Hypothesis

Postnatal environmental exposures, particularly those found in household products and dietary intake, along with specific serum metabolomics profiles, are significantly associated with the BMI Z-score of children aged 6-11 years. Higher concentrations of certain metabolites in serum, reflecting exposure to chemical classes or metals, will correlate with variations in BMI Z-score, controlling for age and other relevant covariates. Some metabolites associated with chemical exposures and dietary patterns can serve as biomarkers for the risk of developing obesity.

Background

Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.

A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.

The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.

The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.

These studies collectively illustrate the critical role of environmental EDCs in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.

Data Description

This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals like BPA and PCBs in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between EDC exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
variable_name domain family subfamily period location period_postnatal description var_type transformation labels labelsshort
h_bfdur_Ter h_bfdur_Ter Lifestyles Lifestyle Diet Postnatal NA NA Breastfeeding duration (weeks) factor Tertiles Breastfeeding Breastfeeding
hs_bakery_prod_Ter hs_bakery_prod_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bakery products (hs_cookies + hs_pastries) factor Tertiles Bakery prod BakeProd
hs_beverages_Ter hs_beverages_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: beverages (hs_dietsoda+hs_soda) factor Tertiles Soda Soda
hs_break_cer_Ter hs_break_cer_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: breakfast cereal (hs_sugarcer+hs_othcer) factor Tertiles BF cereals BFcereals
hs_caff_drink_Ter hs_caff_drink_Ter Lifestyles Lifestyle Diet Postnatal NA NA Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) factor Tertiles Caffeine Caffeine
hs_dairy_Ter hs_dairy_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) factor Tertiles Dairy Dairy
hs_fastfood_Ter hs_fastfood_Ter Lifestyles Lifestyle Diet Postnatal NA NA Visits a fast food restaurant/take away factor Tertiles Fastfood Fastfood
hs_KIDMED_None hs_KIDMED_None Lifestyles Lifestyle Diet Postnatal NA NA Sum of KIDMED indices, without index9 numeric None KIDMED KIDMED
hs_mvpa_prd_alt_None hs_mvpa_prd_alt_None Lifestyles Lifestyle Physical activity Postnatal NA NA Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) numeric None PA PA
hs_org_food_Ter hs_org_food_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats organic food factor Tertiles Organicfood Organicfood
hs_proc_meat_Ter hs_proc_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: processed meat (hs_coldmeat+hs_ham) factor Tertiles Processed meat ProcMeat
hs_readymade_Ter hs_readymade_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats a æready-made supermarket meal factor Tertiles Ready made food ReadyFood
hs_sd_wk_None hs_sd_wk_None Lifestyles Lifestyle Physical activity Postnatal NA NA sedentary behaviour (min/day) numeric None Sedentary Sedentary
hs_total_bread_Ter hs_total_bread_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bread (hs_darkbread+hs_whbread) factor Tertiles Bread Bread
hs_total_cereal_Ter hs_total_cereal_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) factor Tertiles Cereals Cereals
hs_total_fish_Ter hs_total_fish_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) factor Tertiles Fish Fish
hs_total_fruits_Ter hs_total_fruits_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) factor Tertiles Fruits Fruits
hs_total_lipids_Ter hs_total_lipids_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: Added fat factor Tertiles Diet fat Diet fat
hs_total_meat_Ter hs_total_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) factor Tertiles Meat Meat
hs_total_potatoes_Ter hs_total_potatoes_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: potatoes (hs_frenchfries+hs_potatoes) factor Tertiles Potatoes Potatoes
hs_total_sweets_Ter hs_total_sweets_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: sweets (hs_choco + hs_sweets + hs_sugar) factor Tertiles Sweets Sweets
hs_total_veg_Ter hs_total_veg_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: vegetables (hs_cookveg+hs_rawveg) factor Tertiles Vegetables Vegetables
hs_total_yog_Ter hs_total_yog_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: yogurt (hs_yogurt+hs_probiotic) factor Tertiles Yogurt Yogurt
hs_dif_hours_total_None hs_dif_hours_total_None Lifestyles Lifestyle Sleep Postnatal NA NA Total hours of sleep (mean weekdays and night) numeric None Sleep Sleep
hs_as_c_Log2 hs_as_c_Log2 Chemicals Metals As Postnatal NA NA Arsenic (As) in child numeric Logarithm base 2 As As
hs_cd_c_Log2 hs_cd_c_Log2 Chemicals Metals Cd Postnatal NA NA Cadmium (Cd) in child numeric Logarithm base 2 Cd Cd
hs_co_c_Log2 hs_co_c_Log2 Chemicals Metals Co Postnatal NA NA Cobalt (Co) in child numeric Logarithm base 2 Co Co
hs_cs_c_Log2 hs_cs_c_Log2 Chemicals Metals Cs Postnatal NA NA Caesium (Cs) in child numeric Logarithm base 2 Cs Cs
hs_cu_c_Log2 hs_cu_c_Log2 Chemicals Metals Cu Postnatal NA NA Copper (Cu) in child numeric Logarithm base 2 Cu Cu
hs_hg_c_Log2 hs_hg_c_Log2 Chemicals Metals Hg Postnatal NA NA Mercury (Hg) in child numeric Logarithm base 2 Hg Hg
hs_mn_c_Log2 hs_mn_c_Log2 Chemicals Metals Mn Postnatal NA NA Manganese (Mn) in child numeric Logarithm base 2 Mn Mn
hs_mo_c_Log2 hs_mo_c_Log2 Chemicals Metals Mo Postnatal NA NA Molybdenum (Mo) in child numeric Logarithm base 2 Mo Mo
hs_pb_c_Log2 hs_pb_c_Log2 Chemicals Metals Pb Postnatal NA NA Lead (Pb) in child numeric Logarithm base 2 Pb Pb
hs_tl_cdich_None hs_tl_cdich_None Chemicals Metals Tl Postnatal NA NA Dichotomous variable of thallium (Tl) in child factor None Tl Tl
hs_dde_cadj_Log2 hs_dde_cadj_Log2 Chemicals Organochlorines DDE Postnatal NA NA Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids numeric Logarithm base 2 DDE DDE
hs_ddt_cadj_Log2 hs_ddt_cadj_Log2 Chemicals Organochlorines DDT Postnatal NA NA Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids numeric Logarithm base 2 DDT DDT
hs_hcb_cadj_Log2 hs_hcb_cadj_Log2 Chemicals Organochlorines HCB Postnatal NA NA Hexachlorobenzene (HCB) in child adjusted for lipids numeric Logarithm base 2 HCB HCB
hs_pcb118_cadj_Log2 hs_pcb118_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids numeric Logarithm base 2 PCB 118 PCB118
hs_pcb138_cadj_Log2 hs_pcb138_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids numeric Logarithm base 2 PCB 138 PCB138
hs_pcb153_cadj_Log2 hs_pcb153_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids numeric Logarithm base 2 PCB 153 PCB153
hs_pcb170_cadj_Log2 hs_pcb170_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids numeric Logarithm base 2 PCB 170 PCB170
hs_pcb180_cadj_Log2 hs_pcb180_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids numeric Logarithm base 2 PCB 180 PCB180
hs_sumPCBs5_cadj_Log2 hs_sumPCBs5_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Sum of PCBs in child adjusted for lipids (4 cohorts) numeric Logarithm base 2 PCBs SumPCB
hs_dep_cadj_Log2 hs_dep_cadj_Log2 Chemicals Organophosphate pesticides DEP Postnatal NA NA Diethyl phosphate (DEP) in child adjusted for creatinine numeric Logarithm base 2 DEP DEP
hs_detp_cadj_Log2 hs_detp_cadj_Log2 Chemicals Organophosphate pesticides DETP Postnatal NA NA Diethyl thiophosphate (DETP) in child adjusted for creatinine numeric Logarithm base 2 DETP DETP
hs_dmdtp_cdich_None hs_dmdtp_cdich_None Chemicals Organophosphate pesticides DMDTP Postnatal NA NA Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child factor None DMDTP DMDTP
hs_dmp_cadj_Log2 hs_dmp_cadj_Log2 Chemicals Organophosphate pesticides DMP Postnatal NA NA Dimethyl phosphate (DMP) in child adjusted for creatinine numeric Logarithm base 2 DMP DMP
hs_dmtp_cadj_Log2 hs_dmtp_cadj_Log2 Chemicals Organophosphate pesticides DMTP Postnatal NA NA Dimethyl thiophosphate (DMTP) in child adjusted for creatinine numeric Logarithm base 2 DMDTP DMTP
hs_pbde153_cadj_Log2 hs_pbde153_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE153 Postnatal NA NA Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids numeric Logarithm base 2 PBDE 153 PBDE153
hs_pbde47_cadj_Log2 hs_pbde47_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE47 Postnatal NA NA Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids numeric Logarithm base 2 PBDE 47 PBDE47
hs_pfhxs_c_Log2 hs_pfhxs_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFHXS Postnatal NA NA Perfluorohexane sulfonate (PFHXS) in child numeric Logarithm base 2 PFHXS PFHXS
hs_pfna_c_Log2 hs_pfna_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFNA Postnatal NA NA Perfluorononanoate (PFNA) in child numeric Logarithm base 2 PFNA PFNA
hs_pfoa_c_Log2 hs_pfoa_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOA Postnatal NA NA Perfluorooctanoate (PFOA) in child numeric Logarithm base 2 PFOA PFOA
hs_pfos_c_Log2 hs_pfos_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOS Postnatal NA NA Perfluorooctane sulfonate (PFOS) in child numeric Logarithm base 2 PFOS PFOS
hs_pfunda_c_Log2 hs_pfunda_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFUNDA Postnatal NA NA Perfluoroundecanoate (PFUNDA) in child numeric Logarithm base 2 PFUNDA PFUNDA
hs_bpa_cadj_Log2 hs_bpa_cadj_Log2 Chemicals Phenols BPA Postnatal NA NA Bisphenol A (BPA) in child adjusted for creatinine numeric Logarithm base 2 BPA BPA
hs_bupa_cadj_Log2 hs_bupa_cadj_Log2 Chemicals Phenols BUPA Postnatal NA NA N-Butyl paraben (BUPA) in child adjusted for creatinine numeric Logarithm base 2 BUPA BUPA
hs_etpa_cadj_Log2 hs_etpa_cadj_Log2 Chemicals Phenols ETPA Postnatal NA NA Ethyl paraben (ETPA) in child adjusted for creatinine numeric Logarithm base 2 ETPA ETPA
hs_mepa_cadj_Log2 hs_mepa_cadj_Log2 Chemicals Phenols MEPA Postnatal NA NA Methyl paraben (MEPA) in child adjusted for creatinine numeric Logarithm base 2 MEPA MEPA
hs_oxbe_cadj_Log2 hs_oxbe_cadj_Log2 Chemicals Phenols OXBE Postnatal NA NA Oxybenzone (OXBE) in child adjusted for creatinine numeric Logarithm base 2 OXBE OXBE
hs_prpa_cadj_Log2 hs_prpa_cadj_Log2 Chemicals Phenols PRPA Postnatal NA NA Propyl paraben (PRPA) in child adjusted for creatinine numeric Logarithm base 2 PRPA PRPA
hs_trcs_cadj_Log2 hs_trcs_cadj_Log2 Chemicals Phenols TRCS Postnatal NA NA Triclosan (TRCS) in child adjusted for creatinine numeric Logarithm base 2 TRCS TRCS
hs_mbzp_cadj_Log2 hs_mbzp_cadj_Log2 Chemicals Phthalates MBZP Postnatal NA NA Mono benzyl phthalate (MBzP) in child adjusted for creatinine numeric Logarithm base 2 MBZP MBZP
hs_mecpp_cadj_Log2 hs_mecpp_cadj_Log2 Chemicals Phthalates MECPP Postnatal NA NA Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine numeric Logarithm base 2 MECPP MECPP
hs_mehhp_cadj_Log2 hs_mehhp_cadj_Log2 Chemicals Phthalates MEHHP Postnatal NA NA Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine numeric Logarithm base 2 MEHHP MEHHP
hs_mehp_cadj_Log2 hs_mehp_cadj_Log2 Chemicals Phthalates MEHP Postnatal NA NA Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine numeric Logarithm base 2 MEHP MEHP
hs_meohp_cadj_Log2 hs_meohp_cadj_Log2 Chemicals Phthalates MEOHP Postnatal NA NA Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine numeric Logarithm base 2 MEOHP MEOHP
hs_mep_cadj_Log2 hs_mep_cadj_Log2 Chemicals Phthalates MEP Postnatal NA NA Monoethyl phthalate (MEP) in child adjusted for creatinine numeric Logarithm base 2 MEP MEP
hs_mibp_cadj_Log2 hs_mibp_cadj_Log2 Chemicals Phthalates MIBP Postnatal NA NA Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine numeric Logarithm base 2 MIBP MIBP
hs_mnbp_cadj_Log2 hs_mnbp_cadj_Log2 Chemicals Phthalates MNBP Postnatal NA NA Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine numeric Logarithm base 2 MNBP MNBP
hs_ohminp_cadj_Log2 hs_ohminp_cadj_Log2 Chemicals Phthalates OHMiNP Postnatal NA NA Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine numeric Logarithm base 2 OHMiNP OHMiNP
hs_oxominp_cadj_Log2 hs_oxominp_cadj_Log2 Chemicals Phthalates OXOMINP Postnatal NA NA Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine numeric Logarithm base 2 OXOMINP OXOMINP
hs_sumDEHP_cadj_Log2 hs_sumDEHP_cadj_Log2 Chemicals Phthalates DEHP Postnatal NA NA Sum of DEHP metabolites (µg/g) in child adjusted for creatinine numeric Logarithm base 2 DEHP SumDEHP
FAS_cat_None FAS_cat_None Chemicals Social and economic capital Economic capital Postnatal NA NA Family affluence score factor None Family affluence FamAfl
hs_contactfam_3cat_num_None hs_contactfam_3cat_num_None Chemicals Social and economic capital Social capital Postnatal NA NA scoial capital: family friends factor None Social contact SocCont
hs_hm_pers_None hs_hm_pers_None Chemicals Social and economic capital Social capital Postnatal NA NA How many people live in your home? numeric None House crowding HouseCrow
hs_participation_3cat_None hs_participation_3cat_None Chemicals Social and economic capital Social capital Postnatal NA NA social capital: structural factor None Social participation SocPartic
hs_cotinine_cdich_None hs_cotinine_cdich_None Chemicals Tobacco Smoke Cotinine Postnatal NA NA Dichotomous variable of cotinine in child factor None Cotinine Cotinine
hs_globalexp2_None hs_globalexp2_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Global exposure of the child to ETS (2 categories) factor None ETS ETS
hs_smk_parents_None hs_smk_parents_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Tobacco Smoke status of parents (both) factor None Smoking_parents SmokPar
e3_sex_None e3_sex_None Covariates Covariates Child covariate Pregnancy NA NA Child sex (female / male) factor None Child sex Sex
e3_yearbir_None e3_yearbir_None Covariates Covariates Child covariate Pregnancy NA NA Year of birth (2003 to 2009) factor None Year of birth YearBirth
h_cohort h_cohort Covariates Covariates Maternal covariate Pregnancy NA NA Cohort of inclusion (1 to 6) factor None Cohort Cohort
hs_child_age_None hs_child_age_None Covariates Covariates Child covariate Postnatal NA NA Child age at examination (years) numeric None Child age cAge
hs_zbmi_who hs_zbmi_who Phenotype Phenotype Outcome at 6-11 years old Postnatal NA NA Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age numeric None Body mass index z-score zBMI

Data Summary for Exposures, Covariates, and Outcome

Data Summary Exposures: Lifestyles

# specific lifestyle exposures
lifestyle_exposures <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

lifestyle_exposome <- dplyr::select(exposome, all_of(lifestyle_exposures))
summarytools::view(dfSummary(lifestyle_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 h_bfdur_Ter [factor]
1. (0,10.8]
2. (10.8,34.9]
3. (34.9,Inf]
506(38.9%)
270(20.8%)
525(40.4%)
0 (0.0%)
2 hs_bakery_prod_Ter [factor]
1. (0,2]
2. (2,6]
3. (6,Inf]
345(26.5%)
423(32.5%)
533(41.0%)
0 (0.0%)
3 hs_break_cer_Ter [factor]
1. (0,1.1]
2. (1.1,5.5]
3. (5.5,Inf]
291(22.4%)
521(40.0%)
489(37.6%)
0 (0.0%)
4 hs_dairy_Ter [factor]
1. (0,14.6]
2. (14.6,25.6]
3. (25.6,Inf]
359(27.6%)
465(35.7%)
477(36.7%)
0 (0.0%)
5 hs_fastfood_Ter [factor]
1. (0,0.132]
2. (0.132,0.5]
3. (0.5,Inf]
143(11.0%)
603(46.3%)
555(42.7%)
0 (0.0%)
6 hs_org_food_Ter [factor]
1. (0,0.132]
2. (0.132,1]
3. (1,Inf]
429(33.0%)
396(30.4%)
476(36.6%)
0 (0.0%)
7 hs_proc_meat_Ter [factor]
1. (0,1.5]
2. (1.5,4]
3. (4,Inf]
366(28.1%)
471(36.2%)
464(35.7%)
0 (0.0%)
8 hs_total_fish_Ter [factor]
1. (0,1.5]
2. (1.5,3]
3. (3,Inf]
389(29.9%)
454(34.9%)
458(35.2%)
0 (0.0%)
9 hs_total_fruits_Ter [factor]
1. (0,7]
2. (7,14.1]
3. (14.1,Inf]
413(31.7%)
407(31.3%)
481(37.0%)
0 (0.0%)
10 hs_total_lipids_Ter [factor]
1. (0,3]
2. (3,7]
3. (7,Inf]
397(30.5%)
403(31.0%)
501(38.5%)
0 (0.0%)
11 hs_total_sweets_Ter [factor]
1. (0,4.1]
2. (4.1,8.5]
3. (8.5,Inf]
344(26.4%)
516(39.7%)
441(33.9%)
0 (0.0%)
12 hs_total_veg_Ter [factor]
1. (0,6]
2. (6,8.5]
3. (8.5,Inf]
404(31.1%)
314(24.1%)
583(44.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21

categorical_lifestyle <- lifestyle_exposome %>% 
  dplyr::select(where(is.factor))

categorical_lifestyle_long <- pivot_longer(
  categorical_lifestyle,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_lifestyle_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_lifestyle_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.

Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.

Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.

Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.

Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.

Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.

Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.

Bread: Distribution shows a significant leaning towards higher bread consumption.

Cereal: Even distribution across categories suggests varied cereal consumption habits.

Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.

Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.

Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.

Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.

Vegetables: Most participants consume a high amount of vegetables.

Data Summary Exposures: Chemicals

# specific chemical exposures
chemical_exposures <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_cd_c_Log2 [numeric]
Mean (sd) : -4 (1)
min ≤ med ≤ max:
-10.4 ≤ -3.8 ≤ 0.8
IQR (CV) : 1 (-0.3)
695 distinct values 0 (0.0%)
2 hs_co_c_Log2 [numeric]
Mean (sd) : -2.3 (0.6)
min ≤ med ≤ max:
-5.5 ≤ -2.4 ≤ 1.4
IQR (CV) : 0.7 (-0.3)
317 distinct values 0 (0.0%)
3 hs_cs_c_Log2 [numeric]
Mean (sd) : 0.4 (0.6)
min ≤ med ≤ max:
-1.5 ≤ 0.5 ≤ 3.1
IQR (CV) : 0.8 (1.3)
369 distinct values 0 (0.0%)
4 hs_cu_c_Log2 [numeric]
Mean (sd) : 9.8 (0.2)
min ≤ med ≤ max:
9.1 ≤ 9.8 ≤ 12.1
IQR (CV) : 0.3 (0)
345 distinct values 0 (0.0%)
5 hs_hg_c_Log2 [numeric]
Mean (sd) : -0.3 (1.7)
min ≤ med ≤ max:
-10.9 ≤ -0.2 ≤ 3.7
IQR (CV) : 2.1 (-5.6)
698 distinct values 0 (0.0%)
6 hs_mo_c_Log2 [numeric]
Mean (sd) : -0.3 (0.9)
min ≤ med ≤ max:
-9.2 ≤ -0.4 ≤ 5.1
IQR (CV) : 0.8 (-2.9)
593 distinct values 0 (0.0%)
7 hs_pb_c_Log2 [numeric]
Mean (sd) : 3.1 (0.6)
min ≤ med ≤ max:
1.1 ≤ 3.1 ≤ 7.7
IQR (CV) : 0.8 (0.2)
529 distinct values 0 (0.0%)
8 hs_dde_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1.5)
min ≤ med ≤ max:
1.2 ≤ 4.5 ≤ 11.1
IQR (CV) : 1.9 (0.3)
1050 distinct values 0 (0.0%)
9 hs_pcb153_cadj_Log2 [numeric]
Mean (sd) : 3.6 (0.9)
min ≤ med ≤ max:
1.2 ≤ 3.5 ≤ 7.8
IQR (CV) : 1.4 (0.3)
1047 distinct values 0 (0.0%)
10 hs_pcb170_cadj_Log2 [numeric]
Mean (sd) : -0.3 (3)
min ≤ med ≤ max:
-16.8 ≤ 0.3 ≤ 4.8
IQR (CV) : 2.2 (-9.8)
1039 distinct values 0 (0.0%)
11 hs_dep_cadj_Log2 [numeric]
Mean (sd) : 0.2 (3.2)
min ≤ med ≤ max:
-12.6 ≤ 0.9 ≤ 9.4
IQR (CV) : 3.3 (20)
1045 distinct values 0 (0.0%)
12 hs_pbde153_cadj_Log2 [numeric]
Mean (sd) : -4.5 (3.8)
min ≤ med ≤ max:
-17.6 ≤ -2.6 ≤ 4
IQR (CV) : 6.7 (-0.8)
1036 distinct values 0 (0.0%)
13 hs_pfhxs_c_Log2 [numeric]
Mean (sd) : -1.6 (1.3)
min ≤ med ≤ max:
-8.9 ≤ -1.4 ≤ 4.8
IQR (CV) : 1.7 (-0.8)
1061 distinct values 0 (0.0%)
14 hs_pfoa_c_Log2 [numeric]
Mean (sd) : 0.6 (0.6)
min ≤ med ≤ max:
-2.2 ≤ 0.6 ≤ 2.7
IQR (CV) : 0.7 (0.9)
1061 distinct values 0 (0.0%)
15 hs_pfos_c_Log2 [numeric]
Mean (sd) : 1 (1.1)
min ≤ med ≤ max:
-10.4 ≤ 1 ≤ 5.1
IQR (CV) : 1.3 (1.1)
1050 distinct values 0 (0.0%)
16 hs_prpa_cadj_Log2 [numeric]
Mean (sd) : -1.6 (3.8)
min ≤ med ≤ max:
-12 ≤ -2.3 ≤ 10.8
IQR (CV) : 5.2 (-2.4)
1031 distinct values 0 (0.0%)
17 hs_mbzp_cadj_Log2 [numeric]
Mean (sd) : 2.4 (1.2)
min ≤ med ≤ max:
-0.6 ≤ 2.3 ≤ 7.2
IQR (CV) : 1.5 (0.5)
1046 distinct values 0 (0.0%)
18 hs_mibp_cadj_Log2 [numeric]
Mean (sd) : 5.5 (1.1)
min ≤ med ≤ max:
2.3 ≤ 5.4 ≤ 9.8
IQR (CV) : 1.5 (0.2)
1057 distinct values 0 (0.0%)
19 hs_mnbp_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1)
min ≤ med ≤ max:
1.9 ≤ 4.6 ≤ 8.9
IQR (CV) : 1.3 (0.2)
1048 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21

#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>% 
  dplyr::select(where(is.numeric))

numeric_chemical_long <- pivot_longer(
  numeric_chemical,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_chemical_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_chemical_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Cadmium (hs_cd_c_Log2): The histogram for Cadmium exposure (hs_cd_c_Log2) shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.

Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.

Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5

Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.

Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.

Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.

Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.

DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..

PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.

PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.

DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.

PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.

PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.

PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.

PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.

Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.

Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.

Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.

Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.

numeric_chemical <- select_if(chemical_exposome, is.numeric)
cor_matrix <- cor(numeric_chemical, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)

Data Summary Covariates

# Specified covariates
specific_covariates <- c(
  "e3_sex_None", 
  "e3_yearbir_None",
  "h_cohort", 
  "hs_child_age_None"
)

covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 e3_sex_None [factor]
1. female
2. male
608(46.7%)
693(53.3%)
0 (0.0%)
2 e3_yearbir_None [factor]
1. 2003
2. 2004
3. 2005
4. 2006
5. 2007
6. 2008
7. 2009
55(4.2%)
107(8.2%)
241(18.5%)
256(19.7%)
250(19.2%)
379(29.1%)
13(1.0%)
0 (0.0%)
3 h_cohort [factor]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
202(15.5%)
198(15.2%)
224(17.2%)
207(15.9%)
272(20.9%)
198(15.2%)
0 (0.0%)
4 hs_child_age_None [numeric]
Mean (sd) : 8 (1.6)
min ≤ med ≤ max:
5.4 ≤ 8 ≤ 12.1
IQR (CV) : 2.4 (0.2)
879 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21

#separate numeric and categorical data
numeric_covariates <- covariate_data %>% 
  dplyr::select(where(is.numeric))

numeric_covariates_long <- pivot_longer(
  numeric_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_covariates_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_covariates_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.

categorical_covariates <- covariate_data %>% 
  dplyr::select(where(is.factor))

categorical_covariates_long <- pivot_longer(
  categorical_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_covariates_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Cohorts (h_cohort): The distribution shows the count of subjects across six different cohorts. All cohorts have a substantial number of subjects, with cohort 5 showing the highest participation.

Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.

Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.

Data Summary Outcome: Phenotype

outcome_BMI <- phenotype %>% 
  dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_zbmi_who [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
-3.6 ≤ 0.3 ≤ 4.7
IQR (CV) : 1.5 (3)
421 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21

Descriptive Tables

# Combine all selected data
combined_data <- cbind(covariate_data, lifestyle_exposome, chemical_exposome, outcome_BMI)

# Ensure no duplicated columns
combined_data <- combined_data[, !duplicated(colnames(combined_data))]

# Convert sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")

render_cont <- function(x) {
  with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}

render_cat <- function(x) {
  c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}

# Define the formula for table1
table1_formula <- ~ 
  hs_child_age_None + e3_yearbir_None + h_cohort +
  hs_zbmi_who +
  h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
  hs_proc_meat_Ter +
  hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
  hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
  hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
  hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
  hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
  hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None

# Create the table
table1(
  table1_formula,
  data = combined_data,
  render.continuous = render_cont,
  render.categorical = render_cat,
  overall = TRUE,
  topclass = "Rtable1-shade"
)
Male
(N=608)
Female
(N=693)
TRUE
(N=1301)
hs_child_age_None 7.91 (1.58) 8.03 (1.64) 7.98 (1.61)
e3_yearbir_None
2003 25 (4.1 %) 30 (4.3 %) 55 (4.2 %)
2004 46 (7.6 %) 61 (8.8 %) 107 (8.2 %)
2005 121 (19.9 %) 120 (17.3 %) 241 (18.5 %)
2006 108 (17.8 %) 148 (21.4 %) 256 (19.7 %)
2007 128 (21.1 %) 122 (17.6 %) 250 (19.2 %)
2008 177 (29.1 %) 202 (29.1 %) 379 (29.1 %)
2009 3 (0.5 %) 10 (1.4 %) 13 (1.0 %)
h_cohort
1 97 (16.0 %) 105 (15.2 %) 202 (15.5 %)
2 86 (14.1 %) 112 (16.2 %) 198 (15.2 %)
3 102 (16.8 %) 122 (17.6 %) 224 (17.2 %)
4 93 (15.3 %) 114 (16.5 %) 207 (15.9 %)
5 129 (21.2 %) 143 (20.6 %) 272 (20.9 %)
6 101 (16.6 %) 97 (14.0 %) 198 (15.2 %)
hs_zbmi_who 0.35 (1.15) 0.45 (1.22) 0.40 (1.19)
h_bfdur_Ter
(0,10.8] 231 (38.0 %) 275 (39.7 %) 506 (38.9 %)
(10.8,34.9] 118 (19.4 %) 152 (21.9 %) 270 (20.8 %)
(34.9,Inf] 259 (42.6 %) 266 (38.4 %) 525 (40.4 %)
hs_bakery_prod_Ter
(0,2] 164 (27.0 %) 181 (26.1 %) 345 (26.5 %)
(2,6] 188 (30.9 %) 235 (33.9 %) 423 (32.5 %)
(6,Inf] 256 (42.1 %) 277 (40.0 %) 533 (41.0 %)
hs_break_cer_Ter
(0,1.1] 141 (23.2 %) 150 (21.6 %) 291 (22.4 %)
(1.1,5.5] 251 (41.3 %) 270 (39.0 %) 521 (40.0 %)
(5.5,Inf] 216 (35.5 %) 273 (39.4 %) 489 (37.6 %)
hs_dairy_Ter
(0,14.6] 175 (28.8 %) 184 (26.6 %) 359 (27.6 %)
(14.6,25.6] 229 (37.7 %) 236 (34.1 %) 465 (35.7 %)
(25.6,Inf] 204 (33.6 %) 273 (39.4 %) 477 (36.7 %)
hs_fastfood_Ter
(0,0.132] 75 (12.3 %) 68 (9.8 %) 143 (11.0 %)
(0.132,0.5] 273 (44.9 %) 330 (47.6 %) 603 (46.3 %)
(0.5,Inf] 260 (42.8 %) 295 (42.6 %) 555 (42.7 %)
hs_org_food_Ter
(0,0.132] 211 (34.7 %) 218 (31.5 %) 429 (33.0 %)
(0.132,1] 191 (31.4 %) 205 (29.6 %) 396 (30.4 %)
(1,Inf] 206 (33.9 %) 270 (39.0 %) 476 (36.6 %)
hs_proc_meat_Ter
(0,1.5] 175 (28.8 %) 191 (27.6 %) 366 (28.1 %)
(1.5,4] 227 (37.3 %) 244 (35.2 %) 471 (36.2 %)
(4,Inf] 206 (33.9 %) 258 (37.2 %) 464 (35.7 %)
hs_total_fish_Ter
(0,1.5] 183 (30.1 %) 206 (29.7 %) 389 (29.9 %)
(1.5,3] 224 (36.8 %) 230 (33.2 %) 454 (34.9 %)
(3,Inf] 201 (33.1 %) 257 (37.1 %) 458 (35.2 %)
hs_total_fruits_Ter
(0,7] 174 (28.6 %) 239 (34.5 %) 413 (31.7 %)
(7,14.1] 216 (35.5 %) 191 (27.6 %) 407 (31.3 %)
(14.1,Inf] 218 (35.9 %) 263 (38.0 %) 481 (37.0 %)
hs_total_lipids_Ter
(0,3] 193 (31.7 %) 204 (29.4 %) 397 (30.5 %)
(3,7] 171 (28.1 %) 232 (33.5 %) 403 (31.0 %)
(7,Inf] 244 (40.1 %) 257 (37.1 %) 501 (38.5 %)
hs_total_sweets_Ter
(0,4.1] 149 (24.5 %) 195 (28.1 %) 344 (26.4 %)
(4.1,8.5] 251 (41.3 %) 265 (38.2 %) 516 (39.7 %)
(8.5,Inf] 208 (34.2 %) 233 (33.6 %) 441 (33.9 %)
hs_total_veg_Ter
(0,6] 190 (31.2 %) 214 (30.9 %) 404 (31.1 %)
(6,8.5] 136 (22.4 %) 178 (25.7 %) 314 (24.1 %)
(8.5,Inf] 282 (46.4 %) 301 (43.4 %) 583 (44.8 %)
hs_cd_c_Log2 -3.99 (0.98) -3.95 (1.09) -3.97 (1.04)
hs_co_c_Log2 -2.37 (0.61) -2.32 (0.64) -2.34 (0.63)
hs_cs_c_Log2 0.44 (0.58) 0.44 (0.57) 0.44 (0.57)
hs_cu_c_Log2 9.81 (0.25) 9.84 (0.22) 9.83 (0.23)
hs_hg_c_Log2 -0.24 (1.59) -0.35 (1.75) -0.30 (1.68)
hs_mo_c_Log2 -0.32 (0.83) -0.31 (0.96) -0.32 (0.90)
hs_dde_cadj_Log2 4.63 (1.48) 4.70 (1.50) 4.67 (1.49)
hs_pcb153_cadj_Log2 3.47 (0.86) 3.63 (0.94) 3.56 (0.90)
hs_pcb170_cadj_Log2 -0.60 (3.22) -0.05 (2.77) -0.31 (3.00)
hs_dep_cadj_Log2 0.27 (3.16) 0.06 (3.25) 0.16 (3.21)
hs_pbde153_cadj_Log2 -4.66 (3.86) -4.40 (3.80) -4.53 (3.83)
hs_pfhxs_c_Log2 -1.62 (1.30) -1.53 (1.31) -1.57 (1.31)
hs_pfoa_c_Log2 0.60 (0.55) 0.62 (0.56) 0.61 (0.55)
hs_pfos_c_Log2 0.95 (1.15) 0.99 (1.08) 0.97 (1.11)
hs_prpa_cadj_Log2 -1.26 (3.96) -1.91 (3.68) -1.61 (3.82)
hs_mbzp_cadj_Log2 2.42 (1.23) 2.47 (1.22) 2.44 (1.22)
hs_mibp_cadj_Log2 5.54 (1.09) 5.39 (1.12) 5.46 (1.11)
hs_mnbp_cadj_Log2 4.77 (1.08) 4.60 (0.96) 4.68 (1.02)
combined_data$h_cohort <- as.factor(combined_data$h_cohort)
# Create the table
table1(
  ~ hs_child_age_None + e3_sex_None + e3_yearbir_None +
    hs_zbmi_who + h_bfdur_Ter + hs_bakery_prod_Ter +
    hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter +
    hs_org_food_Ter + hs_proc_meat_Ter + hs_total_fish_Ter + hs_total_fruits_Ter +
    hs_total_lipids_Ter +
    hs_total_sweets_Ter + hs_total_veg_Ter +
    hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
    hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
    hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
    hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
    hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | h_cohort,
    data = combined_data,
  render.continuous = render_cont,
  render.categorical = render_cat,
  overall = TRUE,
  topclass = "Rtable1-shade"
)
1
(N=202)
2
(N=198)
3
(N=224)
4
(N=207)
5
(N=272)
6
(N=198)
TRUE
(N=1301)
hs_child_age_None 6.61 (0.28) 10.82 (0.58) 8.78 (0.58) 6.48 (0.47) 8.46 (0.53) 6.51 (0.30) 7.98 (1.61)
e3_sex_None
Male 97 (48.0 %) 86 (43.4 %) 102 (45.5 %) 93 (44.9 %) 129 (47.4 %) 101 (51.0 %) 608 (46.7 %)
Female 105 (52.0 %) 112 (56.6 %) 122 (54.5 %) 114 (55.1 %) 143 (52.6 %) 97 (49.0 %) 693 (53.3 %)
e3_yearbir_None
2003 0 (0.0 %) 55 (27.8 %) 0 (0.0 %) 0 (0.0 %) 0 (0.0 %) 0 (0.0 %) 55 (4.2 %)
2004 0 (0.0 %) 107 (54.0 %) 0 (0.0 %) 0 (0.0 %) 0 (0.0 %) 0 (0.0 %) 107 (8.2 %)
2005 0 (0.0 %) 36 (18.2 %) 120 (53.6 %) 0 (0.0 %) 85 (31.2 %) 0 (0.0 %) 241 (18.5 %)
2006 0 (0.0 %) 0 (0.0 %) 99 (44.2 %) 0 (0.0 %) 157 (57.7 %) 0 (0.0 %) 256 (19.7 %)
2007 82 (40.6 %) 0 (0.0 %) 5 (2.2 %) 62 (30.0 %) 30 (11.0 %) 71 (35.9 %) 250 (19.2 %)
2008 117 (57.9 %) 0 (0.0 %) 0 (0.0 %) 136 (65.7 %) 0 (0.0 %) 126 (63.6 %) 379 (29.1 %)
2009 3 (1.5 %) 0 (0.0 %) 0 (0.0 %) 9 (4.3 %) 0 (0.0 %) 1 (0.5 %) 13 (1.0 %)
hs_zbmi_who 0.20 (1.15) 0.19 (1.13) 0.80 (1.22) 0.52 (1.22) 0.09 (0.90) 0.68 (1.37) 0.40 (1.19)
h_bfdur_Ter
(0,10.8] 74 (36.6 %) 119 (60.1 %) 70 (31.2 %) 58 (28.0 %) 101 (37.1 %) 84 (42.4 %) 506 (38.9 %)
(10.8,34.9] 2 (1.0 %) 57 (28.8 %) 100 (44.6 %) 30 (14.5 %) 0 (0.0 %) 81 (40.9 %) 270 (20.8 %)
(34.9,Inf] 126 (62.4 %) 22 (11.1 %) 54 (24.1 %) 119 (57.5 %) 171 (62.9 %) 33 (16.7 %) 525 (40.4 %)
hs_bakery_prod_Ter
(0,2] 29 (14.4 %) 41 (20.7 %) 39 (17.4 %) 34 (16.4 %) 187 (68.8 %) 15 (7.6 %) 345 (26.5 %)
(2,6] 66 (32.7 %) 51 (25.8 %) 89 (39.7 %) 84 (40.6 %) 74 (27.2 %) 59 (29.8 %) 423 (32.5 %)
(6,Inf] 107 (53.0 %) 106 (53.5 %) 96 (42.9 %) 89 (43.0 %) 11 (4.0 %) 124 (62.6 %) 533 (41.0 %)
hs_break_cer_Ter
(0,1.1] 18 (8.9 %) 65 (32.8 %) 61 (27.2 %) 38 (18.4 %) 57 (21.0 %) 52 (26.3 %) 291 (22.4 %)
(1.1,5.5] 55 (27.2 %) 67 (33.8 %) 89 (39.7 %) 101 (48.8 %) 114 (41.9 %) 95 (48.0 %) 521 (40.0 %)
(5.5,Inf] 129 (63.9 %) 66 (33.3 %) 74 (33.0 %) 68 (32.9 %) 101 (37.1 %) 51 (25.8 %) 489 (37.6 %)
hs_dairy_Ter
(0,14.6] 21 (10.4 %) 41 (20.7 %) 55 (24.6 %) 128 (61.8 %) 76 (27.9 %) 38 (19.2 %) 359 (27.6 %)
(14.6,25.6] 86 (42.6 %) 49 (24.7 %) 99 (44.2 %) 51 (24.6 %) 91 (33.5 %) 89 (44.9 %) 465 (35.7 %)
(25.6,Inf] 95 (47.0 %) 108 (54.5 %) 70 (31.2 %) 28 (13.5 %) 105 (38.6 %) 71 (35.9 %) 477 (36.7 %)
hs_fastfood_Ter
(0,0.132] 18 (8.9 %) 23 (11.6 %) 18 (8.0 %) 51 (24.6 %) 24 (8.8 %) 9 (4.5 %) 143 (11.0 %)
(0.132,0.5] 40 (19.8 %) 101 (51.0 %) 127 (56.7 %) 106 (51.2 %) 169 (62.1 %) 60 (30.3 %) 603 (46.3 %)
(0.5,Inf] 144 (71.3 %) 74 (37.4 %) 79 (35.3 %) 50 (24.2 %) 79 (29.0 %) 129 (65.2 %) 555 (42.7 %)
hs_org_food_Ter
(0,0.132] 114 (56.4 %) 51 (25.8 %) 118 (52.7 %) 19 (9.2 %) 9 (3.3 %) 118 (59.6 %) 429 (33.0 %)
(0.132,1] 40 (19.8 %) 73 (36.9 %) 70 (31.2 %) 75 (36.2 %) 109 (40.1 %) 29 (14.6 %) 396 (30.4 %)
(1,Inf] 48 (23.8 %) 74 (37.4 %) 36 (16.1 %) 113 (54.6 %) 154 (56.6 %) 51 (25.8 %) 476 (36.6 %)
hs_proc_meat_Ter
(0,1.5] 118 (58.4 %) 47 (23.7 %) 25 (11.2 %) 83 (40.1 %) 39 (14.3 %) 54 (27.3 %) 366 (28.1 %)
(1.5,4] 32 (15.8 %) 90 (45.5 %) 85 (37.9 %) 71 (34.3 %) 85 (31.2 %) 108 (54.5 %) 471 (36.2 %)
(4,Inf] 52 (25.7 %) 61 (30.8 %) 114 (50.9 %) 53 (25.6 %) 148 (54.4 %) 36 (18.2 %) 464 (35.7 %)
hs_total_fish_Ter
(0,1.5] 82 (40.6 %) 38 (19.2 %) 25 (11.2 %) 130 (62.8 %) 38 (14.0 %) 76 (38.4 %) 389 (29.9 %)
(1.5,3] 53 (26.2 %) 103 (52.0 %) 47 (21.0 %) 57 (27.5 %) 94 (34.6 %) 100 (50.5 %) 454 (34.9 %)
(3,Inf] 67 (33.2 %) 57 (28.8 %) 152 (67.9 %) 20 (9.7 %) 140 (51.5 %) 22 (11.1 %) 458 (35.2 %)
hs_total_fruits_Ter
(0,7] 26 (12.9 %) 107 (54.0 %) 83 (37.1 %) 99 (47.8 %) 35 (12.9 %) 63 (31.8 %) 413 (31.7 %)
(7,14.1] 42 (20.8 %) 45 (22.7 %) 85 (37.9 %) 64 (30.9 %) 82 (30.1 %) 89 (44.9 %) 407 (31.3 %)
(14.1,Inf] 134 (66.3 %) 46 (23.2 %) 56 (25.0 %) 44 (21.3 %) 155 (57.0 %) 46 (23.2 %) 481 (37.0 %)
hs_total_lipids_Ter
(0,3] 18 (8.9 %) 31 (15.7 %) 151 (67.4 %) 24 (11.6 %) 32 (11.8 %) 141 (71.2 %) 397 (30.5 %)
(3,7] 72 (35.6 %) 90 (45.5 %) 40 (17.9 %) 74 (35.7 %) 82 (30.1 %) 45 (22.7 %) 403 (31.0 %)
(7,Inf] 112 (55.4 %) 77 (38.9 %) 33 (14.7 %) 109 (52.7 %) 158 (58.1 %) 12 (6.1 %) 501 (38.5 %)
hs_total_sweets_Ter
(0,4.1] 50 (24.8 %) 39 (19.7 %) 93 (41.5 %) 19 (9.2 %) 89 (32.7 %) 54 (27.3 %) 344 (26.4 %)
(4.1,8.5] 77 (38.1 %) 61 (30.8 %) 88 (39.3 %) 58 (28.0 %) 125 (46.0 %) 107 (54.0 %) 516 (39.7 %)
(8.5,Inf] 75 (37.1 %) 98 (49.5 %) 43 (19.2 %) 130 (62.8 %) 58 (21.3 %) 37 (18.7 %) 441 (33.9 %)
hs_total_veg_Ter
(0,6] 65 (32.2 %) 53 (26.8 %) 94 (42.0 %) 81 (39.1 %) 42 (15.4 %) 69 (34.8 %) 404 (31.1 %)
(6,8.5] 41 (20.3 %) 42 (21.2 %) 69 (30.8 %) 53 (25.6 %) 57 (21.0 %) 52 (26.3 %) 314 (24.1 %)
(8.5,Inf] 96 (47.5 %) 103 (52.0 %) 61 (27.2 %) 73 (35.3 %) 173 (63.6 %) 77 (38.9 %) 583 (44.8 %)
hs_cd_c_Log2 -3.87 (0.84) -4.06 (1.22) -4.22 (1.23) -4.16 (1.11) -3.60 (0.74) -3.99 (0.91) -3.97 (1.04)
hs_co_c_Log2 -2.31 (0.52) -2.38 (0.56) -2.46 (0.64) -2.37 (0.64) -2.53 (0.64) -1.93 (0.56) -2.34 (0.63)
hs_cs_c_Log2 0.12 (0.45) 1.01 (0.47) 0.61 (0.45) -0.17 (0.39) 0.71 (0.40) 0.29 (0.39) 0.44 (0.57)
hs_cu_c_Log2 9.86 (0.23) 9.88 (0.25) 9.83 (0.20) 9.80 (0.21) 9.71 (0.21) 9.93 (0.21) 9.83 (0.23)
hs_hg_c_Log2 -0.56 (1.59) 0.67 (1.29) 0.92 (1.30) -1.97 (1.49) -0.34 (1.06) -0.57 (1.69) -0.30 (1.68)
hs_mo_c_Log2 -0.13 (0.79) -0.58 (1.18) -0.55 (0.77) -0.42 (0.84) -0.17 (0.74) -0.07 (0.95) -0.32 (0.90)
hs_dde_cadj_Log2 3.81 (1.31) 4.01 (1.28) 4.36 (1.24) 5.67 (1.29) 4.26 (0.94) 6.06 (1.41) 4.67 (1.49)
hs_pcb153_cadj_Log2 2.73 (0.63) 3.50 (0.76) 3.66 (0.84) 3.93 (0.85) 4.22 (0.69) 3.03 (0.68) 3.56 (0.90)
hs_pcb170_cadj_Log2 -2.44 (3.33) 0.33 (1.89) 0.41 (2.42) -0.81 (3.58) 1.38 (1.63) -1.38 (3.14) -0.31 (3.00)
hs_dep_cadj_Log2 1.44 (3.30) -0.27 (3.31) -0.15 (3.07) -1.42 (3.25) 0.62 (2.85) 0.66 (2.82) 0.16 (3.21)
hs_pbde153_cadj_Log2 -3.39 (3.79) -5.11 (3.61) -5.05 (3.83) -4.86 (3.78) -2.66 (3.00) -6.71 (3.67) -4.53 (3.83)
hs_pfhxs_c_Log2 -1.48 (1.03) -0.51 (0.83) -1.55 (0.88) -2.69 (1.19) -0.66 (0.76) -2.83 (1.08) -1.57 (1.31)
hs_pfoa_c_Log2 0.86 (0.50) 0.56 (0.53) 0.52 (0.51) 0.42 (0.61) 0.80 (0.43) 0.46 (0.61) 0.61 (0.55)
hs_pfos_c_Log2 0.57 (0.90) 1.64 (0.78) 0.43 (0.97) 0.19 (1.29) 1.67 (0.75) 1.16 (0.88) 0.97 (1.11)
hs_prpa_cadj_Log2 -0.05 (3.69) -2.65 (3.49) 0.69 (3.83) -2.00 (3.98) -3.14 (2.92) -2.22 (3.50) -1.61 (3.82)
hs_mbzp_cadj_Log2 1.60 (1.16) 2.81 (1.19) 2.52 (1.09) 2.81 (1.11) 2.17 (1.11) 2.85 (1.23) 2.44 (1.22)
hs_mibp_cadj_Log2 6.07 (1.02) 5.47 (1.07) 4.88 (0.90) 6.27 (0.87) 4.74 (0.99) 5.63 (0.83) 5.46 (1.11)
hs_mnbp_cadj_Log2 4.74 (0.90) 4.24 (0.86) 3.99 (0.79) 5.47 (0.86) 4.79 (0.89) 4.84 (1.12) 4.68 (1.02)

All Interested Data

outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
  "hs_as_c_Log2",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mn_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_tl_cdich_None",
  "hs_dde_cadj_Log2",
  "hs_ddt_cadj_Log2",
  "hs_hcb_cadj_Log2",
  "hs_pcb118_cadj_Log2",
  "hs_pcb138_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_pcb180_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_dmdtp_cdich_None",
  "hs_dmp_cadj_Log2",
  "hs_dmtp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pbde47_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfna_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_pfunda_c_Log2",
  "hs_bpa_cadj_Log2",
  "hs_bupa_cadj_Log2",
  "hs_etpa_cadj_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_trcs_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mecpp_cadj_Log2",
  "hs_mehhp_cadj_Log2",
  "hs_mehp_cadj_Log2",
  "hs_meohp_cadj_Log2",
  "hs_mep_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2",
  "hs_ohminp_cadj_Log2",
  "hs_oxominp_cadj_Log2",
  "hs_cotinine_cdich_None",
  "hs_globalexp2_None"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_beverages_Ter",
  "hs_break_cer_Ter",
  "hs_caff_drink_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_cereal_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_meat_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_total_yog_Ter"
)

chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))

diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))

all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))

chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)

diet_outcome_cov <- cbind(outcome_cov, all_diet)

interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)
interested_data_corr <- select_if(interested_data, is.numeric)
cor_matrix <- cor(interested_data_corr, method = "pearson")
cor_matrix <- cor(interested_data_corr, method = "spearman")
cor_matrix <- cor(interested_data_corr, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.4)

Comparing Models with and without Covariates

Baseline Covariates Model

covariates_columns <- colnames(covariate_data)
covariates_data <- interested_data %>% dplyr::select(all_of(covariates_columns))

y <- interested_data$hs_zbmi_who

set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
train_data <- interested_data[trainIndex,]
test_data  <- interested_data[-trainIndex,]

x_train_cov <- model.matrix(~ . -1, data = covariates_data[trainIndex,])
x_test_cov <- model.matrix(~ . -1, data = covariates_data[-trainIndex,])

y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who

LASSO

fit_lasso_cov <- cv.glmnet(x_train_cov, y_train, alpha = 1, family = "gaussian")
plot(fit_lasso_cov, xvar = "lambda", main = "Lasso Coefficients Path")

best_lambda_lasso <- fit_lasso_cov$lambda.min
coef(fit_lasso_cov, s = best_lambda_lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                               s1
## (Intercept)          0.238040798
## e3_sex_Nonefemale   -0.006897611
## e3_sex_Nonemale      .          
## e3_yearbir_None2004 -0.068854144
## e3_yearbir_None2005 -0.038686120
## e3_yearbir_None2006  .          
## e3_yearbir_None2007  .          
## e3_yearbir_None2008  0.017116731
## e3_yearbir_None2009  0.198910468
## h_cohort2            .          
## h_cohort3            0.472208207
## h_cohort4            0.314382789
## h_cohort5           -0.049357036
## h_cohort6            0.387237572
## hs_child_age_None    .
lasso_predictions_cov <- predict(fit_lasso_cov, s = best_lambda_lasso, newx = x_test_cov)
mse_lasso_cov <- mean((y_test - lasso_predictions_cov)^2)
rmse_lasso_cov <- sqrt(mse_lasso_cov)
cat("Lasso Test MSE (Covariates only):", mse_lasso_cov, "\n")
## Lasso Test MSE (Covariates only): 1.239095
cat("Lasso Test RMSE (Covariates only):", rmse_lasso_cov, "\n")
## Lasso Test RMSE (Covariates only): 1.113147

Ridge

fit_ridge_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0, family = "gaussian")
plot(fit_ridge_cov, xvar = "lambda", main = "Ridge Coefficients Path")

best_lambda_ridge <- fit_ridge_cov$lambda.min
coef(fit_ridge_cov, s = best_lambda_ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.09319795
## e3_sex_Nonefemale   -0.01865528
## e3_sex_Nonemale      0.01853630
## e3_yearbir_None2004 -0.16725390
## e3_yearbir_None2005 -0.12077752
## e3_yearbir_None2006 -0.05083264
## e3_yearbir_None2007 -0.03585817
## e3_yearbir_None2008  0.02669345
## e3_yearbir_None2009  0.33967684
## h_cohort2           -0.09907039
## h_cohort3            0.43630080
## h_cohort4            0.28804145
## h_cohort5           -0.10836528
## h_cohort6            0.36307763
## hs_child_age_None    0.02739312
ridge_predictions_cov <- predict(fit_ridge_cov, s = best_lambda_ridge, newx = x_test_cov)
mse_ridge_cov <- mean((y_test - ridge_predictions_cov)^2)
rmse_ridge_cov <- sqrt(mse_ridge_cov)
cat("Ridge Test MSE (Covariates only):", mse_ridge_cov, "\n")
## Ridge Test MSE (Covariates only): 1.236435
cat("Ridge Test RMSE (Covariates only):", rmse_ridge_cov, "\n")
## Ridge Test RMSE (Covariates only): 1.111951

Elastic Net

fit_enet_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0.5, family = "gaussian")
plot(fit_enet_cov, xvar = "lambda", main = "Elastic Net Coefficients Path")

best_lambda_enet <- fit_enet_cov$lambda.min
coef(fit_enet_cov, s = best_lambda_enet)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                               s1
## (Intercept)          0.243795099
## e3_sex_Nonefemale   -0.008612618
## e3_sex_Nonemale      0.001147480
## e3_yearbir_None2004 -0.078883364
## e3_yearbir_None2005 -0.041622954
## e3_yearbir_None2006  .          
## e3_yearbir_None2007  .          
## e3_yearbir_None2008  0.019281438
## e3_yearbir_None2009  0.213563502
## h_cohort2            .          
## h_cohort3            0.466431092
## h_cohort4            0.306885448
## h_cohort5           -0.056625036
## h_cohort6            0.379711766
## hs_child_age_None    .
enet_predictions_cov <- predict(fit_enet_cov, s = best_lambda_enet, newx = x_test_cov)
mse_enet_cov <- mean((y_test - enet_predictions_cov)^2)
rmse_enet_cov <- sqrt(mse_enet_cov)
cat("Elastic Net Test MSE (Covariates only):", mse_enet_cov, "\n")
## Elastic Net Test MSE (Covariates only): 1.238834
cat("Elastic Net Test RMSE (Covariates only):", rmse_enet_cov, "\n")
## Elastic Net Test RMSE (Covariates only): 1.113029

Chemicals Data

LASSO

#LASSO train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)

x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]

x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]

x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])

fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 1, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)            -4.7797230131
## hs_as_c_Log2            .           
## hs_cd_c_Log2           -0.0238815730
## hs_co_c_Log2           -0.0011670319
## hs_cs_c_Log2            0.0771865955
## hs_cu_c_Log2            0.6071183261
## hs_hg_c_Log2           -0.0075730086
## hs_mn_c_Log2            .           
## hs_mo_c_Log2           -0.0992489424
## hs_pb_c_Log2           -0.0056257448
## hs_tl_cdich_None        .           
## hs_dde_cadj_Log2       -0.0378984008
## hs_ddt_cadj_Log2        .           
## hs_hcb_cadj_Log2        .           
## hs_pcb118_cadj_Log2     .           
## hs_pcb138_cadj_Log2     .           
## hs_pcb153_cadj_Log2    -0.1721262187
## hs_pcb170_cadj_Log2    -0.0557570999
## hs_pcb180_cadj_Log2     .           
## hs_dep_cadj_Log2       -0.0186165147
## hs_detp_cadj_Log2       .           
## hs_dmdtp_cdich_None     .           
## hs_dmp_cadj_Log2        .           
## hs_dmtp_cadj_Log2       .           
## hs_pbde153_cadj_Log2   -0.0357794002
## hs_pbde47_cadj_Log2     .           
## hs_pfhxs_c_Log2        -0.0019079468
## hs_pfna_c_Log2          .           
## hs_pfoa_c_Log2         -0.1360824261
## hs_pfos_c_Log2         -0.0478302901
## hs_pfunda_c_Log2        .           
## hs_bpa_cadj_Log2        .           
## hs_bupa_cadj_Log2       .           
## hs_etpa_cadj_Log2       .           
## hs_mepa_cadj_Log2       .           
## hs_oxbe_cadj_Log2       0.0008622765
## hs_prpa_cadj_Log2       0.0011728557
## hs_trcs_cadj_Log2       .           
## hs_mbzp_cadj_Log2       0.0373221816
## hs_mecpp_cadj_Log2      .           
## hs_mehhp_cadj_Log2      .           
## hs_mehp_cadj_Log2       .           
## hs_meohp_cadj_Log2      .           
## hs_mep_cadj_Log2        .           
## hs_mibp_cadj_Log2      -0.0477304169
## hs_mnbp_cadj_Log2      -0.0036235331
## hs_ohminp_cadj_Log2     .           
## hs_oxominp_cadj_Log2    .           
## hs_cotinine_cdich_None  .           
## hs_globalexp2_None      .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.231997

Ridge

# RIDGE
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)            -4.469806e+00
## hs_as_c_Log2            6.590433e-03
## hs_cd_c_Log2           -4.093355e-02
## hs_co_c_Log2           -5.049922e-02
## hs_cs_c_Log2            1.230373e-01
## hs_cu_c_Log2            6.078479e-01
## hs_hg_c_Log2           -3.225520e-02
## hs_mn_c_Log2           -3.089195e-02
## hs_mo_c_Log2           -1.068154e-01
## hs_pb_c_Log2           -5.295956e-02
## hs_tl_cdich_None        .           
## hs_dde_cadj_Log2       -4.888006e-02
## hs_ddt_cadj_Log2        4.045085e-03
## hs_hcb_cadj_Log2       -1.857150e-02
## hs_pcb118_cadj_Log2     1.400112e-02
## hs_pcb138_cadj_Log2    -3.614513e-02
## hs_pcb153_cadj_Log2    -1.223407e-01
## hs_pcb170_cadj_Log2    -5.267521e-02
## hs_pcb180_cadj_Log2    -1.074695e-02
## hs_dep_cadj_Log2       -2.548881e-02
## hs_detp_cadj_Log2       8.051621e-03
## hs_dmdtp_cdich_None     .           
## hs_dmp_cadj_Log2       -2.097690e-03
## hs_dmtp_cadj_Log2       7.300567e-05
## hs_pbde153_cadj_Log2   -3.315313e-02
## hs_pbde47_cadj_Log2     5.273953e-03
## hs_pfhxs_c_Log2        -2.966308e-02
## hs_pfna_c_Log2          2.336166e-02
## hs_pfoa_c_Log2         -1.519872e-01
## hs_pfos_c_Log2         -6.495855e-02
## hs_pfunda_c_Log2        1.248503e-02
## hs_bpa_cadj_Log2        3.832688e-04
## hs_bupa_cadj_Log2       6.588467e-03
## hs_etpa_cadj_Log2      -6.098679e-03
## hs_mepa_cadj_Log2      -1.638466e-02
## hs_oxbe_cadj_Log2       1.390524e-02
## hs_prpa_cadj_Log2       1.258510e-02
## hs_trcs_cadj_Log2       2.878805e-03
## hs_mbzp_cadj_Log2       5.550048e-02
## hs_mecpp_cadj_Log2      1.627174e-03
## hs_mehhp_cadj_Log2      2.316991e-02
## hs_mehp_cadj_Log2      -1.662304e-02
## hs_meohp_cadj_Log2      1.137436e-02
## hs_mep_cadj_Log2        3.371106e-03
## hs_mibp_cadj_Log2      -5.391219e-02
## hs_mnbp_cadj_Log2      -4.383016e-02
## hs_ohminp_cadj_Log2    -2.886768e-02
## hs_oxominp_cadj_Log2    2.204660e-02
## hs_cotinine_cdich_None  .           
## hs_globalexp2_None      .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.188752

Elastic Net

# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                  s1
## (Intercept)            -4.785950188
## hs_as_c_Log2            .          
## hs_cd_c_Log2           -0.025843356
## hs_co_c_Log2           -0.005835867
## hs_cs_c_Log2            0.084715330
## hs_cu_c_Log2            0.607379616
## hs_hg_c_Log2           -0.009800093
## hs_mn_c_Log2            .          
## hs_mo_c_Log2           -0.099724922
## hs_pb_c_Log2           -0.010318890
## hs_tl_cdich_None        .          
## hs_dde_cadj_Log2       -0.039528137
## hs_ddt_cadj_Log2        .          
## hs_hcb_cadj_Log2        .          
## hs_pcb118_cadj_Log2     .          
## hs_pcb138_cadj_Log2     .          
## hs_pcb153_cadj_Log2    -0.169008355
## hs_pcb170_cadj_Log2    -0.055808065
## hs_pcb180_cadj_Log2     .          
## hs_dep_cadj_Log2       -0.019034348
## hs_detp_cadj_Log2       .          
## hs_dmdtp_cdich_None     .          
## hs_dmp_cadj_Log2        .          
## hs_dmtp_cadj_Log2       .          
## hs_pbde153_cadj_Log2   -0.035464586
## hs_pbde47_cadj_Log2     .          
## hs_pfhxs_c_Log2        -0.006816020
## hs_pfna_c_Log2          .          
## hs_pfoa_c_Log2         -0.135997766
## hs_pfos_c_Log2         -0.047692264
## hs_pfunda_c_Log2        .          
## hs_bpa_cadj_Log2        .          
## hs_bupa_cadj_Log2       .          
## hs_etpa_cadj_Log2       .          
## hs_mepa_cadj_Log2       .          
## hs_oxbe_cadj_Log2       0.002529961
## hs_prpa_cadj_Log2       0.001735800
## hs_trcs_cadj_Log2       .          
## hs_mbzp_cadj_Log2       0.040317847
## hs_mecpp_cadj_Log2      .          
## hs_mehhp_cadj_Log2      .          
## hs_mehp_cadj_Log2       .          
## hs_meohp_cadj_Log2      .          
## hs_mep_cadj_Log2        .          
## hs_mibp_cadj_Log2      -0.047892677
## hs_mnbp_cadj_Log2      -0.008483913
## hs_ohminp_cadj_Log2     .          
## hs_oxominp_cadj_Log2    .          
## hs_cotinine_cdich_None  .          
## hs_globalexp2_None      .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.228805
#selected chemicals that were noted in enet
chemicals_selected <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2")

The features for chemicals were selected due to the feature selections of elastic net.

Postnatal Diet Data

LASSO

# LASSO with train/test
set.seed(101)  
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)

diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])  
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])  

covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ]) 
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])

x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)

x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0

y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])

# fit models
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 1, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 1, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.06922     9   1.431 0.06022       5
## 1se 0.14570     1   1.442 0.06160       0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                         s1
## (Intercept)                     0.53256344
## h_bfdur_Ter(0,10.8]             .         
## h_bfdur_Ter(10.8,34.9]          .         
## h_bfdur_Ter(34.9,Inf]           .         
## hs_bakery_prod_Ter(2,6]         .         
## hs_bakery_prod_Ter(6,Inf]       .         
## hs_beverages_Ter(0.132,1]       .         
## hs_beverages_Ter(1,Inf]         .         
## hs_break_cer_Ter(1.1,5.5]       .         
## hs_break_cer_Ter(5.5,Inf]       .         
## hs_caff_drink_Ter(0.132,Inf]    .         
## hs_dairy_Ter(14.6,25.6]         .         
## hs_dairy_Ter(25.6,Inf]          .         
## hs_fastfood_Ter(0.132,0.5]      .         
## hs_fastfood_Ter(0.5,Inf]        .         
## hs_org_food_Ter(0.132,1]        .         
## hs_org_food_Ter(1,Inf]         -0.13588632
## hs_proc_meat_Ter(1.5,4]         .         
## hs_proc_meat_Ter(4,Inf]         .         
## hs_readymade_Ter(0.132,0.5]     .         
## hs_readymade_Ter(0.5,Inf]       .         
## hs_total_bread_Ter(7,17.5]      .         
## hs_total_bread_Ter(17.5,Inf]    .         
## hs_total_cereal_Ter(14.1,23.6]  .         
## hs_total_cereal_Ter(23.6,Inf]   .         
## hs_total_fish_Ter(1.5,3]        .         
## hs_total_fish_Ter(3,Inf]        .         
## hs_total_fruits_Ter(7,14.1]     .         
## hs_total_fruits_Ter(14.1,Inf]  -0.02481964
## hs_total_lipids_Ter(3,7]        .         
## hs_total_lipids_Ter(7,Inf]     -0.05164312
## hs_total_meat_Ter(6,9]          .         
## hs_total_meat_Ter(9,Inf]        .         
## hs_total_potatoes_Ter(3,4]      .         
## hs_total_potatoes_Ter(4,Inf]    .         
## hs_total_sweets_Ter(4.1,8.5]   -0.01594403
## hs_total_sweets_Ter(8.5,Inf]    .         
## hs_total_veg_Ter(6,8.5]         .         
## hs_total_veg_Ter(8.5,Inf]      -0.08180563
## hs_total_yog_Ter(6,8.5]         .         
## hs_total_yog_Ter(8.5,Inf]       .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.34942

Ridge

# RIDGE
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min   3.53    41   1.431 0.08497      40
## 1se 145.70     1   1.441 0.08233      40
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                     0.5163069457
## h_bfdur_Ter(0,10.8]            -0.0114164662
## h_bfdur_Ter(10.8,34.9]          0.0353770607
## h_bfdur_Ter(34.9,Inf]          -0.0138651651
## hs_bakery_prod_Ter(2,6]         0.0228606785
## hs_bakery_prod_Ter(6,Inf]      -0.0268639952
## hs_beverages_Ter(0.132,1]      -0.0065939314
## hs_beverages_Ter(1,Inf]        -0.0016124215
## hs_break_cer_Ter(1.1,5.5]      -0.0034207548
## hs_break_cer_Ter(5.5,Inf]      -0.0337182186
## hs_caff_drink_Ter(0.132,Inf]   -0.0143879393
## hs_dairy_Ter(14.6,25.6]         0.0355023507
## hs_dairy_Ter(25.6,Inf]         -0.0005581647
## hs_fastfood_Ter(0.132,0.5]      0.0161761119
## hs_fastfood_Ter(0.5,Inf]       -0.0001750742
## hs_org_food_Ter(0.132,1]        0.0151677373
## hs_org_food_Ter(1,Inf]         -0.0682466785
## hs_proc_meat_Ter(1.5,4]         0.0222199344
## hs_proc_meat_Ter(4,Inf]        -0.0187135643
## hs_readymade_Ter(0.132,0.5]    -0.0013536008
## hs_readymade_Ter(0.5,Inf]       0.0105115509
## hs_total_bread_Ter(7,17.5]     -0.0035702530
## hs_total_bread_Ter(17.5,Inf]   -0.0070550360
## hs_total_cereal_Ter(14.1,23.6]  0.0082269928
## hs_total_cereal_Ter(23.6,Inf]  -0.0131001584
## hs_total_fish_Ter(1.5,3]       -0.0346609367
## hs_total_fish_Ter(3,Inf]       -0.0051749487
## hs_total_fruits_Ter(7,14.1]     0.0266413533
## hs_total_fruits_Ter(14.1,Inf]  -0.0389551124
## hs_total_lipids_Ter(3,7]       -0.0022752284
## hs_total_lipids_Ter(7,Inf]     -0.0476627593
## hs_total_meat_Ter(6,9]          0.0007524275
## hs_total_meat_Ter(9,Inf]        0.0005196923
## hs_total_potatoes_Ter(3,4]      0.0105526823
## hs_total_potatoes_Ter(4,Inf]    0.0048180175
## hs_total_sweets_Ter(4.1,8.5]   -0.0392140671
## hs_total_sweets_Ter(8.5,Inf]   -0.0010028529
## hs_total_veg_Ter(6,8.5]         0.0009962184
## hs_total_veg_Ter(8.5,Inf]      -0.0556956882
## hs_total_yog_Ter(6,8.5]        -0.0102351610
## hs_total_yog_Ter(8.5,Inf]      -0.0089303177
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.326308

Elastic Net

#ELASTIC NET
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.07218    16   1.430 0.05641      12
## 1se 0.29139     1   1.444 0.05877       0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                     0.650606526
## h_bfdur_Ter(0,10.8]             .          
## h_bfdur_Ter(10.8,34.9]          0.039832328
## h_bfdur_Ter(34.9,Inf]           .          
## hs_bakery_prod_Ter(2,6]         .          
## hs_bakery_prod_Ter(6,Inf]      -0.052635590
## hs_beverages_Ter(0.132,1]       .          
## hs_beverages_Ter(1,Inf]         .          
## hs_break_cer_Ter(1.1,5.5]       .          
## hs_break_cer_Ter(5.5,Inf]      -0.054788470
## hs_caff_drink_Ter(0.132,Inf]    .          
## hs_dairy_Ter(14.6,25.6]         0.053455833
## hs_dairy_Ter(25.6,Inf]          .          
## hs_fastfood_Ter(0.132,0.5]      .          
## hs_fastfood_Ter(0.5,Inf]        .          
## hs_org_food_Ter(0.132,1]        .          
## hs_org_food_Ter(1,Inf]         -0.185235916
## hs_proc_meat_Ter(1.5,4]         0.008558872
## hs_proc_meat_Ter(4,Inf]         .          
## hs_readymade_Ter(0.132,0.5]     .          
## hs_readymade_Ter(0.5,Inf]       .          
## hs_total_bread_Ter(7,17.5]      .          
## hs_total_bread_Ter(17.5,Inf]    .          
## hs_total_cereal_Ter(14.1,23.6]  .          
## hs_total_cereal_Ter(23.6,Inf]   .          
## hs_total_fish_Ter(1.5,3]       -0.057540803
## hs_total_fish_Ter(3,Inf]        .          
## hs_total_fruits_Ter(7,14.1]     0.017171763
## hs_total_fruits_Ter(14.1,Inf]  -0.054914989
## hs_total_lipids_Ter(3,7]        .          
## hs_total_lipids_Ter(7,Inf]     -0.094342286
## hs_total_meat_Ter(6,9]          .          
## hs_total_meat_Ter(9,Inf]        .          
## hs_total_potatoes_Ter(3,4]      .          
## hs_total_potatoes_Ter(4,Inf]    .          
## hs_total_sweets_Ter(4.1,8.5]   -0.089860153
## hs_total_sweets_Ter(8.5,Inf]    .          
## hs_total_veg_Ter(6,8.5]         .          
## hs_total_veg_Ter(8.5,Inf]      -0.118161721
## hs_total_yog_Ter(6,8.5]         .          
## hs_total_yog_Ter(8.5,Inf]       .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.335144
#selected diets that were noted in enet
diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

Combined Data (Chemicals & Postnatal Diet)

LASSO

set.seed(101)
train_indices <- sample(seq_len(nrow(interested_data)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(interested_data)), train_indices)

diet_data <- interested_data[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])

chemical_data <- interested_data[, chemicals_full]
x_chemical_train <- as.matrix(chemical_data[train_indices, ])
x_chemical_test <- as.matrix(chemical_data[test_indices, ])

covariates <- interested_data[, c("e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])

# combine diet and chemical data with and without covariates
x_combined_train <- cbind(x_diet_train, x_chemical_train)
x_combined_test <- cbind(x_diet_test, x_chemical_test)

x_full_train <- cbind(x_combined_train, x_covariates_train)
x_full_test <- cbind(x_combined_test, x_covariates_test)

# make sure no missing values
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_combined_train[is.na(x_combined_train)] <- 0
x_combined_test[is.na(x_combined_test)] <- 0

y_train <- as.numeric(interested_data$hs_zbmi_who[train_indices])
y_test <- as.numeric(interested_data$hs_zbmi_who[test_indices])

# LASSO
fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 1, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 90 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                    -5.016149911
## h_bfdur_Ter(0,10.8]            -0.129594522
## h_bfdur_Ter(10.8,34.9]          .          
## h_bfdur_Ter(34.9,Inf]           .          
## hs_bakery_prod_Ter(2,6]         .          
## hs_bakery_prod_Ter(6,Inf]      -0.217291423
## hs_beverages_Ter(0.132,1]       .          
## hs_beverages_Ter(1,Inf]         .          
## hs_break_cer_Ter(1.1,5.5]       .          
## hs_break_cer_Ter(5.5,Inf]       .          
## hs_caff_drink_Ter(0.132,Inf]    .          
## hs_dairy_Ter(14.6,25.6]         0.009808165
## hs_dairy_Ter(25.6,Inf]          .          
## hs_fastfood_Ter(0.132,0.5]      0.070972556
## hs_fastfood_Ter(0.5,Inf]        .          
## hs_org_food_Ter(0.132,1]        .          
## hs_org_food_Ter(1,Inf]          .          
## hs_proc_meat_Ter(1.5,4]         .          
## hs_proc_meat_Ter(4,Inf]         .          
## hs_readymade_Ter(0.132,0.5]     .          
## hs_readymade_Ter(0.5,Inf]       0.011160944
## hs_total_bread_Ter(7,17.5]     -0.010168208
## hs_total_bread_Ter(17.5,Inf]    .          
## hs_total_cereal_Ter(14.1,23.6]  .          
## hs_total_cereal_Ter(23.6,Inf]   .          
## hs_total_fish_Ter(1.5,3]       -0.024288530
## hs_total_fish_Ter(3,Inf]        .          
## hs_total_fruits_Ter(7,14.1]     .          
## hs_total_fruits_Ter(14.1,Inf]  -0.016129393
## hs_total_lipids_Ter(3,7]        .          
## hs_total_lipids_Ter(7,Inf]     -0.047350302
## hs_total_meat_Ter(6,9]          .          
## hs_total_meat_Ter(9,Inf]        .          
## hs_total_potatoes_Ter(3,4]      0.018317955
## hs_total_potatoes_Ter(4,Inf]    .          
## hs_total_sweets_Ter(4.1,8.5]   -0.006515994
## hs_total_sweets_Ter(8.5,Inf]    .          
## hs_total_veg_Ter(6,8.5]         .          
## hs_total_veg_Ter(8.5,Inf]      -0.041036632
## hs_total_yog_Ter(6,8.5]         .          
## hs_total_yog_Ter(8.5,Inf]       .          
## hs_as_c_Log2                    .          
## hs_cd_c_Log2                   -0.022337287
## hs_co_c_Log2                   -0.003616434
## hs_cs_c_Log2                    0.070483114
## hs_cu_c_Log2                    0.656568320
## hs_hg_c_Log2                   -0.012267249
## hs_mn_c_Log2                    .          
## hs_mo_c_Log2                   -0.097496432
## hs_pb_c_Log2                    .          
## hs_tl_cdich_None                .          
## hs_dde_cadj_Log2               -0.029771276
## hs_ddt_cadj_Log2                .          
## hs_hcb_cadj_Log2                .          
## hs_pcb118_cadj_Log2             .          
## hs_pcb138_cadj_Log2             .          
## hs_pcb153_cadj_Log2            -0.226942147
## hs_pcb170_cadj_Log2            -0.054403335
## hs_pcb180_cadj_Log2             .          
## hs_dep_cadj_Log2               -0.017878387
## hs_detp_cadj_Log2               .          
## hs_dmdtp_cdich_None             .          
## hs_dmp_cadj_Log2                .          
## hs_dmtp_cadj_Log2               .          
## hs_pbde153_cadj_Log2           -0.035568595
## hs_pbde47_cadj_Log2             .          
## hs_pfhxs_c_Log2                 .          
## hs_pfna_c_Log2                  .          
## hs_pfoa_c_Log2                 -0.125219198
## hs_pfos_c_Log2                 -0.047655946
## hs_pfunda_c_Log2                .          
## hs_bpa_cadj_Log2                .          
## hs_bupa_cadj_Log2               .          
## hs_etpa_cadj_Log2               .          
## hs_mepa_cadj_Log2               .          
## hs_oxbe_cadj_Log2               .          
## hs_prpa_cadj_Log2               .          
## hs_trcs_cadj_Log2               .          
## hs_mbzp_cadj_Log2               0.043689764
## hs_mecpp_cadj_Log2              .          
## hs_mehhp_cadj_Log2              .          
## hs_mehp_cadj_Log2               .          
## hs_meohp_cadj_Log2              .          
## hs_mep_cadj_Log2                .          
## hs_mibp_cadj_Log2              -0.040902710
## hs_mnbp_cadj_Log2              -0.007173325
## hs_ohminp_cadj_Log2             .          
## hs_oxominp_cadj_Log2            .          
## hs_cotinine_cdich_None          .          
## hs_globalexp2_None              .
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.200253

Ridge

# RIDGE
fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 0, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 90 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -3.7486876482
## h_bfdur_Ter(0,10.8]            -0.0862270481
## h_bfdur_Ter(10.8,34.9]          0.0187498222
## h_bfdur_Ter(34.9,Inf]           0.0718972907
## hs_bakery_prod_Ter(2,6]        -0.0033853186
## hs_bakery_prod_Ter(6,Inf]      -0.1580980396
## hs_beverages_Ter(0.132,1]       0.0052318976
## hs_beverages_Ter(1,Inf]        -0.0339118523
## hs_break_cer_Ter(1.1,5.5]       0.0042988311
## hs_break_cer_Ter(5.5,Inf]      -0.0503391950
## hs_caff_drink_Ter(0.132,Inf]    0.0156001183
## hs_dairy_Ter(14.6,25.6]         0.0416574408
## hs_dairy_Ter(25.6,Inf]         -0.0174860568
## hs_fastfood_Ter(0.132,0.5]      0.0650667870
## hs_fastfood_Ter(0.5,Inf]       -0.0300919849
## hs_org_food_Ter(0.132,1]        0.0284491409
## hs_org_food_Ter(1,Inf]         -0.0490021669
## hs_proc_meat_Ter(1.5,4]         0.0055207383
## hs_proc_meat_Ter(4,Inf]        -0.0063080789
## hs_readymade_Ter(0.132,0.5]     0.0307292842
## hs_readymade_Ter(0.5,Inf]       0.0632539981
## hs_total_bread_Ter(7,17.5]     -0.0544944827
## hs_total_bread_Ter(17.5,Inf]    0.0146129335
## hs_total_cereal_Ter(14.1,23.6] -0.0004875292
## hs_total_cereal_Ter(23.6,Inf]   0.0180167268
## hs_total_fish_Ter(1.5,3]       -0.0683250014
## hs_total_fish_Ter(3,Inf]        0.0112125503
## hs_total_fruits_Ter(7,14.1]     0.0353241028
## hs_total_fruits_Ter(14.1,Inf]  -0.0433100932
## hs_total_lipids_Ter(3,7]       -0.0171427895
## hs_total_lipids_Ter(7,Inf]     -0.0848619938
## hs_total_meat_Ter(6,9]          0.0172861408
## hs_total_meat_Ter(9,Inf]        0.0044053472
## hs_total_potatoes_Ter(3,4]      0.0536415284
## hs_total_potatoes_Ter(4,Inf]   -0.0115575388
## hs_total_sweets_Ter(4.1,8.5]   -0.0692484887
## hs_total_sweets_Ter(8.5,Inf]   -0.0097071229
## hs_total_veg_Ter(6,8.5]         0.0031586461
## hs_total_veg_Ter(8.5,Inf]      -0.0567605211
## hs_total_yog_Ter(6,8.5]        -0.0245534422
## hs_total_yog_Ter(8.5,Inf]      -0.0386998840
## hs_as_c_Log2                    0.0050439215
## hs_cd_c_Log2                   -0.0352737869
## hs_co_c_Log2                   -0.0396473666
## hs_cs_c_Log2                    0.0905666600
## hs_cu_c_Log2                    0.5291861050
## hs_hg_c_Log2                   -0.0253437065
## hs_mn_c_Log2                   -0.0187832842
## hs_mo_c_Log2                   -0.0835328881
## hs_pb_c_Log2                   -0.0275390915
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -0.0366806354
## hs_ddt_cadj_Log2                0.0032185740
## hs_hcb_cadj_Log2               -0.0317509983
## hs_pcb118_cadj_Log2             0.0025521400
## hs_pcb138_cadj_Log2            -0.0518399321
## hs_pcb153_cadj_Log2            -0.1215197442
## hs_pcb170_cadj_Log2            -0.0418593821
## hs_pcb180_cadj_Log2            -0.0225049584
## hs_dep_cadj_Log2               -0.0189572104
## hs_detp_cadj_Log2               0.0059280868
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2               -0.0024527279
## hs_dmtp_cadj_Log2               0.0008420662
## hs_pbde153_cadj_Log2           -0.0277474044
## hs_pbde47_cadj_Log2             0.0052481134
## hs_pfhxs_c_Log2                -0.0305593699
## hs_pfna_c_Log2                 -0.0041077407
## hs_pfoa_c_Log2                 -0.1108211867
## hs_pfos_c_Log2                 -0.0475012252
## hs_pfunda_c_Log2                0.0072385180
## hs_bpa_cadj_Log2               -0.0063616978
## hs_bupa_cadj_Log2               0.0036910227
## hs_etpa_cadj_Log2              -0.0049963326
## hs_mepa_cadj_Log2              -0.0096168009
## hs_oxbe_cadj_Log2               0.0101184198
## hs_prpa_cadj_Log2               0.0061492375
## hs_trcs_cadj_Log2               0.0062007460
## hs_mbzp_cadj_Log2               0.0427566149
## hs_mecpp_cadj_Log2              0.0064702943
## hs_mehhp_cadj_Log2              0.0137458333
## hs_mehp_cadj_Log2              -0.0049569930
## hs_meohp_cadj_Log2              0.0093327409
## hs_mep_cadj_Log2                0.0064280760
## hs_mibp_cadj_Log2              -0.0385576131
## hs_mnbp_cadj_Log2              -0.0344895032
## hs_ohminp_cadj_Log2            -0.0210558232
## hs_oxominp_cadj_Log2            0.0113725933
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.154844

Elastic Net

# ELASTIC NET
fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 0.5, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 90 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -5.0446731390
## h_bfdur_Ter(0,10.8]            -0.1251508851
## h_bfdur_Ter(10.8,34.9]          .           
## h_bfdur_Ter(34.9,Inf]           0.0089409662
## hs_bakery_prod_Ter(2,6]         .           
## hs_bakery_prod_Ter(6,Inf]      -0.2141179309
## hs_beverages_Ter(0.132,1]       .           
## hs_beverages_Ter(1,Inf]         .           
## hs_break_cer_Ter(1.1,5.5]       .           
## hs_break_cer_Ter(5.5,Inf]       .           
## hs_caff_drink_Ter(0.132,Inf]    .           
## hs_dairy_Ter(14.6,25.6]         0.0175630268
## hs_dairy_Ter(25.6,Inf]          .           
## hs_fastfood_Ter(0.132,0.5]      0.0739162130
## hs_fastfood_Ter(0.5,Inf]        .           
## hs_org_food_Ter(0.132,1]        .           
## hs_org_food_Ter(1,Inf]         -0.0049809964
## hs_proc_meat_Ter(1.5,4]         .           
## hs_proc_meat_Ter(4,Inf]         .           
## hs_readymade_Ter(0.132,0.5]     .           
## hs_readymade_Ter(0.5,Inf]       0.0170253867
## hs_total_bread_Ter(7,17.5]     -0.0178078486
## hs_total_bread_Ter(17.5,Inf]    .           
## hs_total_cereal_Ter(14.1,23.6]  .           
## hs_total_cereal_Ter(23.6,Inf]   .           
## hs_total_fish_Ter(1.5,3]       -0.0311197485
## hs_total_fish_Ter(3,Inf]        .           
## hs_total_fruits_Ter(7,14.1]     0.0058224545
## hs_total_fruits_Ter(14.1,Inf]  -0.0180115810
## hs_total_lipids_Ter(3,7]        .           
## hs_total_lipids_Ter(7,Inf]     -0.0529611086
## hs_total_meat_Ter(6,9]          .           
## hs_total_meat_Ter(9,Inf]        .           
## hs_total_potatoes_Ter(3,4]      0.0233163117
## hs_total_potatoes_Ter(4,Inf]    .           
## hs_total_sweets_Ter(4.1,8.5]   -0.0128007469
## hs_total_sweets_Ter(8.5,Inf]    .           
## hs_total_veg_Ter(6,8.5]         .           
## hs_total_veg_Ter(8.5,Inf]      -0.0436127333
## hs_total_yog_Ter(6,8.5]         .           
## hs_total_yog_Ter(8.5,Inf]       .           
## hs_as_c_Log2                    .           
## hs_cd_c_Log2                   -0.0242086233
## hs_co_c_Log2                   -0.0084586207
## hs_cs_c_Log2                    0.0772668783
## hs_cu_c_Log2                    0.6571106900
## hs_hg_c_Log2                   -0.0143619887
## hs_mn_c_Log2                    .           
## hs_mo_c_Log2                   -0.0974389913
## hs_pb_c_Log2                    .           
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -0.0321212412
## hs_ddt_cadj_Log2                .           
## hs_hcb_cadj_Log2                .           
## hs_pcb118_cadj_Log2             .           
## hs_pcb138_cadj_Log2             .           
## hs_pcb153_cadj_Log2            -0.2221589832
## hs_pcb170_cadj_Log2            -0.0546331904
## hs_pcb180_cadj_Log2             .           
## hs_dep_cadj_Log2               -0.0179999867
## hs_detp_cadj_Log2               .           
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2                .           
## hs_dmtp_cadj_Log2               .           
## hs_pbde153_cadj_Log2           -0.0351341084
## hs_pbde47_cadj_Log2             .           
## hs_pfhxs_c_Log2                -0.0055363055
## hs_pfna_c_Log2                  .           
## hs_pfoa_c_Log2                 -0.1254532888
## hs_pfos_c_Log2                 -0.0469893259
## hs_pfunda_c_Log2                .           
## hs_bpa_cadj_Log2                .           
## hs_bupa_cadj_Log2               .           
## hs_etpa_cadj_Log2               .           
## hs_mepa_cadj_Log2               .           
## hs_oxbe_cadj_Log2               .           
## hs_prpa_cadj_Log2               0.0001965683
## hs_trcs_cadj_Log2               .           
## hs_mbzp_cadj_Log2               0.0457827093
## hs_mecpp_cadj_Log2              .           
## hs_mehhp_cadj_Log2              .           
## hs_mehp_cadj_Log2               .           
## hs_meohp_cadj_Log2              .           
## hs_mep_cadj_Log2                .           
## hs_mibp_cadj_Log2              -0.0415220843
## hs_mnbp_cadj_Log2              -0.0111086286
## hs_ohminp_cadj_Log2             .           
## hs_oxominp_cadj_Log2            .           
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.198308

Finalized Data

Selected data based on the enet features without covariates.

Froze the covariates in lasso, ridge and enet to avoid shrinkage.

#selected chemicals that were noted in enet
chemicals_selected <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)
combined_data_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]

finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))

finalized_data <- cbind(outcome_cov, final_selected_data)
head(finalized_data)
numeric_finalized <- finalized_data %>%
  dplyr::select(where(is.numeric))

cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)

find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
  cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA  
  cor_matrix <- as.data.frame(as.table(cor_matrix)) 
  cor_matrix <- na.omit(cor_matrix)  
  cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]  
  cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)  
  return(cor_matrix)
}

highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs
set.seed(101)

# Splitting data into training and test sets
train_indices <- sample(seq_len(nrow(finalized_data)), size = floor(0.7 * nrow(finalized_data)))
test_indices <- setdiff(seq_len(nrow(finalized_data)), train_indices)

# Creating training and test datasets
train_data <- finalized_data[train_indices, ]
test_data <- finalized_data[test_indices, ]

# Separating predictors and outcome variable
x_train <- model.matrix(~ . + 0, data = train_data[ , !names(train_data) %in% "hs_zbmi_who"])
x_test <- model.matrix(~ . + 0, data = test_data[ , !names(test_data) %in% "hs_zbmi_who"])
y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who

covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")

#to freeze the covariates and make sure they are not shrinked
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

Baseline (Covariates)

covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))

x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who

set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.87326347
## hs_child_age_None   -0.05738585
## h_cohort1            .         
## h_cohort2            .         
## h_cohort3            .         
## h_cohort4            .         
## h_cohort5            .         
## h_cohort6            .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004  .         
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  .         
## e3_yearbir_None2009  .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Baseline Lasso MSE:", mse_lasso, "\n")
## Baseline Lasso MSE: 1.252768
cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.119271

Ridge

set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          8.732635e-01
## hs_child_age_None   -5.738585e-02
## h_cohort1           -3.471482e-37
## h_cohort2           -9.186643e-38
## h_cohort3            4.060919e-37
## h_cohort4            1.233393e-37
## h_cohort5           -2.947305e-37
## h_cohort6            1.969905e-37
## e3_sex_Nonemale      3.249059e-38
## e3_yearbir_None2004 -1.529129e-37
## e3_yearbir_None2005  2.825661e-38
## e3_yearbir_None2006 -9.749392e-39
## e3_yearbir_None2007 -5.688503e-38
## e3_yearbir_None2008  3.269518e-38
## e3_yearbir_None2009  3.782484e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Baseline Ridge MSE:", mse_ridge, "\n")
## Baseline Ridge MSE: 1.242846
cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.11483

Elastic Net

set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.87326347
## hs_child_age_None   -0.05738585
## h_cohort1            .         
## h_cohort2            .         
## h_cohort3            .         
## h_cohort4            .         
## h_cohort5            .         
## h_cohort6            .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004  .         
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  .         
## e3_yearbir_None2009  .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Baseline Elastic Net MSE:", mse_enet, "\n")
## Baseline Elastic Net MSE: 1.253518
cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.119606

Decision Tree

set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Baseline Decision Tree MSE:", mse_tree, "\n")
## Baseline Decision Tree MSE: 1.251346
cat("Baseline Decision Tree RMSE:", rmse_tree, "\n")
## Baseline Decision Tree RMSE: 1.118636

The decision tree indicates the importance of covariates, with cohort being significant predictors.

Random Forest

fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)

rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Baseline Random Forest MSE:", mse_rf, "\n")
## Baseline Random Forest MSE: 1.285745
cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.133907

Importance plots show that age and cohort are the most influential variables.

GBM

fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Baseline GBM MSE:", mse_gbm, "\n")
## Baseline GBM MSE: 1.26256
cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.123637

Model Comparison:

The results indicate that Ridge Regression performs slightly better than Lasso and Elastic Net in terms of MSE and RMSE.

The Decision Tree model also performs comparably to the regularization methods with an MSE of 1.251 and RMSE of 1.119. Decision Trees are advantageous for their interpretability, providing clear decision rules based on the covariates.

The Random Forest model shows a slightly higher MSE (1.286) and RMSE (1.134) compared to the regularization methods and Decision Tree. Despite the slightly higher error metrics, Random Forest models are powerful in capturing complex interactions between variables due to the ensemble approach of multiple decision trees.

GBM performs better than Random Forest with an MSE of 1.262 and RMSE of 1.124, indicating its effectiveness in reducing error through boosting iterations.

Age and cohort are consistently the most important variables across different models. Regularized regression models (Ridge and Elastic Net) provide a balance between prediction accuracy and model interpretability.

Diet + Covariates

diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")

diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))

set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                    0.87326347
## hs_child_age_None             -0.05738585
## h_cohort1                      .         
## h_cohort2                      .         
## h_cohort3                      .         
## h_cohort4                      .         
## h_cohort5                      .         
## h_cohort6                      .         
## e3_sex_Nonemale                .         
## e3_yearbir_None2004            .         
## e3_yearbir_None2005            .         
## e3_yearbir_None2006            .         
## e3_yearbir_None2007            .         
## e3_yearbir_None2008            .         
## e3_yearbir_None2009            .         
## h_bfdur_Ter(10.8,34.9]         .         
## h_bfdur_Ter(34.9,Inf]          .         
## hs_bakery_prod_Ter(2,6]        .         
## hs_bakery_prod_Ter(6,Inf]      .         
## hs_break_cer_Ter(1.1,5.5]      .         
## hs_break_cer_Ter(5.5,Inf]      .         
## hs_dairy_Ter(14.6,25.6]        .         
## hs_dairy_Ter(25.6,Inf]         .         
## hs_fastfood_Ter(0.132,0.5]     .         
## hs_fastfood_Ter(0.5,Inf]       .         
## hs_org_food_Ter(0.132,1]       .         
## hs_org_food_Ter(1,Inf]         .         
## hs_proc_meat_Ter(1.5,4]        .         
## hs_proc_meat_Ter(4,Inf]        .         
## hs_total_fish_Ter(1.5,3]       .         
## hs_total_fish_Ter(3,Inf]       .         
## hs_total_fruits_Ter(7,14.1]    .         
## hs_total_fruits_Ter(14.1,Inf]  .         
## hs_total_lipids_Ter(3,7]       .         
## hs_total_lipids_Ter(7,Inf]     .         
## hs_total_sweets_Ter(4.1,8.5]   .         
## hs_total_sweets_Ter(8.5,Inf]   .         
## hs_total_veg_Ter(6,8.5]        .         
## hs_total_veg_Ter(8.5,Inf]      .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Diet + Covariates Lasso MSE:", mse_lasso, "\n")
## Diet + Covariates Lasso MSE: 1.254204
cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.119912

Ridge

fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                    8.732635e-01
## hs_child_age_None             -5.738585e-02
## h_cohort1                     -3.640137e-37
## h_cohort2                     -9.632957e-38
## h_cohort3                      4.258211e-37
## h_cohort4                      1.293315e-37
## h_cohort5                     -3.090494e-37
## h_cohort6                      2.065609e-37
## e3_sex_Nonemale                3.406909e-38
## e3_yearbir_None2004           -1.603419e-37
## e3_yearbir_None2005            2.962941e-38
## e3_yearbir_None2006           -1.022305e-38
## e3_yearbir_None2007           -5.964868e-38
## e3_yearbir_None2008            3.428361e-38
## e3_yearbir_None2009            3.966249e-37
## h_bfdur_Ter(10.8,34.9]         1.967500e-37
## h_bfdur_Ter(34.9,Inf]         -1.201068e-37
## hs_bakery_prod_Ter(2,6]        2.111406e-37
## hs_bakery_prod_Ter(6,Inf]     -1.904476e-37
## hs_break_cer_Ter(1.1,5.5]      9.841445e-39
## hs_break_cer_Ter(5.5,Inf]     -2.296259e-37
## hs_dairy_Ter(14.6,25.6]        5.657332e-38
## hs_dairy_Ter(25.6,Inf]         8.250493e-41
## hs_fastfood_Ter(0.132,0.5]     1.220259e-37
## hs_fastfood_Ter(0.5,Inf]      -6.260868e-38
## hs_org_food_Ter(0.132,1]       8.179714e-38
## hs_org_food_Ter(1,Inf]        -1.908079e-37
## hs_proc_meat_Ter(1.5,4]        1.198077e-37
## hs_proc_meat_Ter(4,Inf]       -2.002609e-38
## hs_total_fish_Ter(1.5,3]      -8.939327e-38
## hs_total_fish_Ter(3,Inf]       2.150559e-38
## hs_total_fruits_Ter(7,14.1]    2.176951e-37
## hs_total_fruits_Ter(14.1,Inf] -2.048991e-37
## hs_total_lipids_Ter(3,7]      -9.208028e-39
## hs_total_lipids_Ter(7,Inf]    -2.162334e-37
## hs_total_sweets_Ter(4.1,8.5]  -7.709333e-38
## hs_total_sweets_Ter(8.5,Inf]  -1.723753e-38
## hs_total_veg_Ter(6,8.5]        2.140950e-38
## hs_total_veg_Ter(8.5,Inf]     -1.287878e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Diet + Covariates Ridge MSE:", mse_ridge, "\n")
## Diet + Covariates Ridge MSE: 1.248868
cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.117528

Elastic Net

fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                    0.87326347
## hs_child_age_None             -0.05738585
## h_cohort1                      .         
## h_cohort2                      .         
## h_cohort3                      .         
## h_cohort4                      .         
## h_cohort5                      .         
## h_cohort6                      .         
## e3_sex_Nonemale                .         
## e3_yearbir_None2004            .         
## e3_yearbir_None2005            .         
## e3_yearbir_None2006            .         
## e3_yearbir_None2007            .         
## e3_yearbir_None2008            .         
## e3_yearbir_None2009            .         
## h_bfdur_Ter(10.8,34.9]         .         
## h_bfdur_Ter(34.9,Inf]          .         
## hs_bakery_prod_Ter(2,6]        .         
## hs_bakery_prod_Ter(6,Inf]      .         
## hs_break_cer_Ter(1.1,5.5]      .         
## hs_break_cer_Ter(5.5,Inf]      .         
## hs_dairy_Ter(14.6,25.6]        .         
## hs_dairy_Ter(25.6,Inf]         .         
## hs_fastfood_Ter(0.132,0.5]     .         
## hs_fastfood_Ter(0.5,Inf]       .         
## hs_org_food_Ter(0.132,1]       .         
## hs_org_food_Ter(1,Inf]         .         
## hs_proc_meat_Ter(1.5,4]        .         
## hs_proc_meat_Ter(4,Inf]        .         
## hs_total_fish_Ter(1.5,3]       .         
## hs_total_fish_Ter(3,Inf]       .         
## hs_total_fruits_Ter(7,14.1]    .         
## hs_total_fruits_Ter(14.1,Inf]  .         
## hs_total_lipids_Ter(3,7]       .         
## hs_total_lipids_Ter(7,Inf]     .         
## hs_total_sweets_Ter(4.1,8.5]   .         
## hs_total_sweets_Ter(8.5,Inf]   .         
## hs_total_veg_Ter(6,8.5]        .         
## hs_total_veg_Ter(8.5,Inf]      .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Diet + Covariates Elastic Net MSE:", mse_enet, "\n")
## Diet + Covariates Elastic Net MSE: 1.25324
cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.119482

Ridge Regression continues to perform well with the addition of diet variables, maintaining its position as one of the best regularization techniques.

Decision Tree

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Diet + Covariates Decision Tree MSE:", mse_tree, "\n")
## Diet + Covariates Decision Tree MSE: 1.249936
cat("Diet + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Covariates Decision Tree RMSE: 1.118005

The top predictor is h_cohort3, followed by hs_child_age_None and hs_bakery_prod_Ter. The Decision Tree model slightly improves when adding diet variables, with a decrease in both MSE (from 1.251 to 1.250) and RMSE (from 1.119 to 1.118).

Random Forest

set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)

importance(fit_rf)
##                     IncNodePurity
## hs_child_age_None       208.99850
## h_cohort                 94.27094
## e3_sex_None              33.36319
## e3_yearbir_None          80.56006
## h_bfdur_Ter              57.63270
## hs_bakery_prod_Ter       64.57253
## hs_break_cer_Ter         63.66238
## hs_dairy_Ter             64.81974
## hs_fastfood_Ter          54.23841
## hs_org_food_Ter          58.37291
## hs_proc_meat_Ter         63.18441
## hs_total_fish_Ter        59.59955
## hs_total_fruits_Ter      60.14126
## hs_total_lipids_Ter      59.03292
## hs_total_sweets_Ter      64.33113
## hs_total_veg_Ter         64.78676
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Diet + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Covariates Random Forest MSE: 1.282615
cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.132526

The Random Forest model shows a slight improvement when diet variables are included, with MSE decreasing from 1.286 to 1.283 and RMSE from 1.134 to 1.133.

GBM

set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Covariates GBM MSE: 1.252086
cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.118967

GBM shows a significant improvement with diet variables, indicating its effectiveness in leveraging additional predictors for better performance.

Model Comparison:

Both Decision Tree and Random Forest models benefit from the inclusion of diet variables, although the improvements are modest. These models can capture complex interactions between diet and covariates.These findings suggest that dietary factors, in conjunction with covariates, enhance the predictive power of the models, justifying their inclusion in more complex modeling efforts.

Chemicals + Covariates

chemicals_selected <- c(
  "hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
  "hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))

set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                               s1
## (Intercept)          -0.22880486
## hs_child_age_None    -0.06945348
## h_cohort1            -0.35586244
## h_cohort2             .         
## h_cohort3             0.15249341
## h_cohort4             .         
## h_cohort5             .         
## h_cohort6             .         
## e3_sex_Nonemale       .         
## e3_yearbir_None2004   .         
## e3_yearbir_None2005   .         
## e3_yearbir_None2006   .         
## e3_yearbir_None2007   .         
## e3_yearbir_None2008   .         
## e3_yearbir_None2009   .         
## hs_cd_c_Log2          .         
## hs_co_c_Log2          .         
## hs_cs_c_Log2          .         
## hs_cu_c_Log2          0.19017840
## hs_hg_c_Log2          .         
## hs_mo_c_Log2         -0.03891260
## hs_pb_c_Log2          .         
## hs_dde_cadj_Log2     -0.04222117
## hs_pcb153_cadj_Log2  -0.16375278
## hs_pcb170_cadj_Log2  -0.04202673
## hs_dep_cadj_Log2      .         
## hs_pbde153_cadj_Log2 -0.02572945
## hs_pfhxs_c_Log2       .         
## hs_pfoa_c_Log2       -0.01268486
## hs_pfos_c_Log2        .         
## hs_prpa_cadj_Log2     .         
## hs_mbzp_cadj_Log2     .         
## hs_mibp_cadj_Log2     .         
## hs_mnbp_cadj_Log2     .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Chemical + Covariates Lasso MSE:", mse_lasso, "\n")
## Chemical + Covariates Lasso MSE: 1.024987
cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.012416

Ridge

fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.5091761675
## hs_child_age_None    -0.0482701048
## h_cohort1            -0.2135450987
## h_cohort2            -0.0811254696
## h_cohort3             0.1431835261
## h_cohort4             0.0960498652
## h_cohort5            -0.0234630806
## h_cohort6             0.0491206037
## e3_sex_Nonemale       0.0372834068
## e3_yearbir_None2004  -0.0970866409
## e3_yearbir_None2005  -0.0052724110
## e3_yearbir_None2006   0.0224854247
## e3_yearbir_None2007  -0.0129379246
## e3_yearbir_None2008   0.0003494858
## e3_yearbir_None2009   0.2186823716
## hs_cd_c_Log2         -0.0296210057
## hs_co_c_Log2         -0.0280149777
## hs_cs_c_Log2          0.0647444145
## hs_cu_c_Log2          0.3115974897
## hs_hg_c_Log2         -0.0140060992
## hs_mo_c_Log2         -0.0639201855
## hs_pb_c_Log2         -0.0378306310
## hs_dde_cadj_Log2     -0.0559819680
## hs_pcb153_cadj_Log2  -0.1241970865
## hs_pcb170_cadj_Log2  -0.0369214916
## hs_dep_cadj_Log2     -0.0092793135
## hs_pbde153_cadj_Log2 -0.0236726456
## hs_pfhxs_c_Log2      -0.0147308317
## hs_pfoa_c_Log2       -0.0837103171
## hs_pfos_c_Log2       -0.0202621556
## hs_prpa_cadj_Log2     0.0009023231
## hs_mbzp_cadj_Log2     0.0266741033
## hs_mibp_cadj_Log2    -0.0283116127
## hs_mnbp_cadj_Log2    -0.0354752569
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Chemical + Covariates Ridge MSE:", mse_ridge, "\n")
## Chemical + Covariates Ridge MSE: 1.022941
cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.011405

Elastic Net

fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                               s1
## (Intercept)          -0.69372755
## hs_child_age_None    -0.07690726
## h_cohort1            -0.38803152
## h_cohort2             .         
## h_cohort3             0.17565147
## h_cohort4             .         
## h_cohort5             .         
## h_cohort6             .         
## e3_sex_Nonemale       .         
## e3_yearbir_None2004   .         
## e3_yearbir_None2005   .         
## e3_yearbir_None2006   .         
## e3_yearbir_None2007   .         
## e3_yearbir_None2008   .         
## e3_yearbir_None2009   .         
## hs_cd_c_Log2          .         
## hs_co_c_Log2          .         
## hs_cs_c_Log2          .         
## hs_cu_c_Log2          0.24840977
## hs_hg_c_Log2          .         
## hs_mo_c_Log2         -0.05153618
## hs_pb_c_Log2          .         
## hs_dde_cadj_Log2     -0.05395542
## hs_pcb153_cadj_Log2  -0.16065967
## hs_pcb170_cadj_Log2  -0.04332864
## hs_dep_cadj_Log2      .         
## hs_pbde153_cadj_Log2 -0.02668759
## hs_pfhxs_c_Log2       .         
## hs_pfoa_c_Log2       -0.03227568
## hs_pfos_c_Log2        .         
## hs_prpa_cadj_Log2     .         
## hs_mbzp_cadj_Log2     .         
## hs_mibp_cadj_Log2     .         
## hs_mnbp_cadj_Log2     .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Chemical + Covariates Elastic Net MSE:", mse_enet, "\n")
## Chemical + Covariates Elastic Net MSE: 1.025036
cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.012441

Lasso and Ridge Regression: Both models show marked improvements, indicating that chemical variables are highly predictive.

Decision Tree

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((y_test - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Chemical + Covariates Decision Tree MSE:", mse_tree, "\n")
## Chemical + Covariates Decision Tree MSE: 1.240155
cat("Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Chemical + Covariates Decision Tree RMSE: 1.113622

The Decision Tree model benefits from chemical variables, though the improvements are more modest compared to the other models.

Random Forest

fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)

importance(fit_rf)
##                      IncNodePurity
## hs_child_age_None        43.503081
## h_cohort                 55.667023
## e3_sex_None               6.437968
## e3_yearbir_None          28.303839
## hs_cd_c_Log2             50.951297
## hs_co_c_Log2             49.920695
## hs_cs_c_Log2             39.773024
## hs_cu_c_Log2             66.621641
## hs_hg_c_Log2             49.142881
## hs_mo_c_Log2             54.875936
## hs_pb_c_Log2             44.095008
## hs_dde_cadj_Log2         90.587837
## hs_pcb153_cadj_Log2      82.018781
## hs_pcb170_cadj_Log2     134.334067
## hs_dep_cadj_Log2         53.198302
## hs_pbde153_cadj_Log2     90.473791
## hs_pfhxs_c_Log2          41.857770
## hs_pfoa_c_Log2           60.276778
## hs_pfos_c_Log2           46.492002
## hs_prpa_cadj_Log2        49.081556
## hs_mbzp_cadj_Log2        50.574249
## hs_mibp_cadj_Log2        43.570481
## hs_mnbp_cadj_Log2        51.144486
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Chemical + Covariates Random Forest MSE: 1.025661
cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.012749

GBM

fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Chemical + Covariates GBM MSE: 1.128855
cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.062476

The GBM model shows substantial improvement, with MSE decreasing from 1.262 to 1.129 and RMSE from 1.124 to 1.062. The GBM model also benefits significantly from the inclusion of chemical variables, showcasing its effectiveness in leveraging additional predictors.

Diet + Chemical + Covariates

covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
  "h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
  "hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
  "hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
  "hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
  "hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)

chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))

set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)

# make the group_indices vector
group_indices <- c(
  rep(1, num_covariates),     # Group 1: Covariates
  rep(2, num_diet),           # Group 2: Diet
  rep(3, num_chemicals)       # Group 3: Chemicals
)

# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
  group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}

length(group_indices) == ncol(x_train)  # should be TRUE
## [1] TRUE

Group LASSO

set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso",, penalty.factor = penalty_factors, family = "gaussian")

group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)

cat("Group Lasso MSE:", mse_group_lasso, "\n")
## Group Lasso MSE: 1.041739
cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.020656

Group Ridge

set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")

group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")

mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)

cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 1.036075
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.017878

Group Elastic Net

set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")

group_enet_predictions <- predict(group_enet_model, x_test, type = "response")

mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)

cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 1.04044
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.020019

Lasso and Ridge Regression: These models show marked improvements, suggesting that the combined diet and chemical variables significantly enhance predictive power.

Decision Tree

set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((y_test - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Diet + Chemical + Covariates Decision Tree MSE:", mse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree MSE: 1.240155
cat("Diet + Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree RMSE: 1.113622

The Decision Tree model shows an improvement with the inclusion of diet and chemical variables, with MSE remaining at 1.240 and RMSE at 1.114. However, the change is not as pronounced as in regularization methods.

Random Forest

set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
varImpPlot(fit_rf)

importance(fit_rf)
##                      IncNodePurity
## hs_child_age_None        35.114980
## h_cohort                 50.347975
## e3_sex_None               5.813078
## e3_yearbir_None          25.549233
## h_bfdur_Ter              12.815405
## hs_bakery_prod_Ter       19.147220
## hs_break_cer_Ter         12.261038
## hs_dairy_Ter             10.800194
## hs_fastfood_Ter          10.750048
## hs_org_food_Ter           9.449158
## hs_proc_meat_Ter         10.641237
## hs_total_fish_Ter        10.840496
## hs_total_fruits_Ter      13.120566
## hs_total_lipids_Ter      10.943549
## hs_total_sweets_Ter      11.777760
## hs_total_veg_Ter         11.552907
## hs_cd_c_Log2             41.428697
## hs_co_c_Log2             42.742568
## hs_cs_c_Log2             35.785447
## hs_cu_c_Log2             55.939681
## hs_hg_c_Log2             40.514511
## hs_mo_c_Log2             49.756941
## hs_pb_c_Log2             38.766595
## hs_dde_cadj_Log2         80.839631
## hs_pcb153_cadj_Log2      76.590310
## hs_pcb170_cadj_Log2     126.930304
## hs_dep_cadj_Log2         46.803114
## hs_pbde153_cadj_Log2     84.266968
## hs_pfhxs_c_Log2          35.959135
## hs_pfoa_c_Log2           49.980790
## hs_pfos_c_Log2           44.717995
## hs_prpa_cadj_Log2        43.147837
## hs_mbzp_cadj_Log2        44.534878
## hs_mibp_cadj_Log2        34.411750
## hs_mnbp_cadj_Log2        43.971153
rf_predictions <- predict(fit_rf, newdata = test_data)

mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Diet + Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Chemical + Covariates Random Forest MSE: 1.01944
cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.009673

The Random Forest model shows one of the most significant improvements, with MSE dropping from 1.286 to 1.025 and RMSE from 1.134 to 1.012.

GBM

set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Covariates GBM MSE: 1.125755
cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.061016

The GBM model also shows significant improvement, with MSE decreasing from 1.262 to 1.128 and RMSE from 1.124 to 1.062.

Diet + Chemical + Metabolomic + Covariates

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))

# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]

# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
ID metab_1 metab_2 metab_3 metab_4 metab_5 metab_6 metab_7 metab_8 metab_9 metab_10 metab_11 metab_12 metab_13 metab_14 metab_15 metab_16 metab_17 metab_18 metab_19 metab_20 metab_21 metab_22 metab_23 metab_24 metab_25 metab_26 metab_27 metab_28 metab_29 metab_30 metab_31 metab_32 metab_33 metab_34 metab_35 metab_36 metab_37 metab_38 metab_39 metab_40 metab_41 metab_42 metab_43 metab_44 metab_45 metab_46 metab_47 metab_48 metab_49 metab_50 metab_51 metab_52 metab_53 metab_54 metab_55 metab_56 metab_57 metab_58 metab_59 metab_60 metab_61 metab_62 metab_63 metab_64 metab_65 metab_66 metab_67 metab_68 metab_69 metab_70 metab_71 metab_72 metab_73 metab_74 metab_75 metab_76 metab_77 metab_78 metab_79 metab_80 metab_81 metab_82 metab_83 metab_84 metab_85 metab_86 metab_87 metab_88 metab_89 metab_90 metab_91 metab_92 metab_93 metab_94 metab_95 metab_96 metab_97 metab_98 metab_99 metab_100 metab_101 metab_102 metab_103 metab_104 metab_105 metab_106 metab_107 metab_108 metab_109 metab_110 metab_111 metab_112 metab_113 metab_114 metab_115 metab_116 metab_117 metab_118 metab_119 metab_120 metab_121 metab_122 metab_123 metab_124 metab_125 metab_126 metab_127 metab_128 metab_129 metab_130 metab_131 metab_132 metab_133 metab_134 metab_135 metab_136 metab_137 metab_138 metab_139 metab_140 metab_141 metab_142 metab_143 metab_144 metab_145 metab_146 metab_147 metab_148 metab_149 metab_150 metab_151 metab_152 metab_153 metab_154 metab_155 metab_156 metab_157 metab_158 metab_159 metab_160 metab_161 metab_162 metab_163 metab_164 metab_165 metab_166 metab_167 metab_168 metab_169 metab_170 metab_171 metab_172 metab_173 metab_174 metab_175 metab_176 metab_177
430 430 -2.15 -0.71 8.60 0.55 7.05 5.79 3.75 5.07 -1.87 -2.77 -3.31 -2.91 -2.94 -1.82 -4.40 -4.10 -5.41 -5.13 -5.35 -3.39 -5.08 -6.06 -6.06 -4.99 -4.46 -4.63 -3.27 -4.61 2.17 -1.73 -4.97 -4.90 -2.63 -5.29 -2.38 -4.06 -5.11 -5.35 -4.80 -3.92 -3.92 -5.47 -4.22 -2.56 -3.93 5.15 6.03 10.20 5.14 7.82 12.31 7.27 7.08 1.79 7.73 7.98 1.96 6.15 0.98 0.60 4.42 4.36 5.85 1.03 2.74 -2.53 -2.05 -2.91 -1.61 -1.63 5.03 0.14 6.23 -2.95 1.29 1.70 -2.83 4.55 4.05 2.56 -0.29 8.33 9.93 4.89 1.28 2.16 5.82 8.95 7.72 8.41 4.71 0.10 2.02 0.16 5.82 7.45 6.17 6.81 -0.70 -1.25 -0.65 2.05 3.39 4.94 -0.69 -1.44 -2.06 -2.44 -1.30 -0.73 -1.52 -2.43 -3.26 1.97 0.03 1.09 3.98 4.56 4.16 0.42 3.48 4.88 3.84 4.70 4.04 1.58 -0.76 1.75 2.48 4.43 4.68 3.29 0.97 1.03 0.44 1.55 2.26 2.72 0.12 -0.90 -0.50 0.02 -0.18 1.02 -2.69 -1.66 0.47 0.28 6.75 7.67 -2.66 -1.52 7.28 -0.08 2.39 1.55 3.01 2.92 -0.48 6.78 3.90 4.05 3.17 -1.46 3.56 4.60 -3.55 -2.79 -1.98 -1.84 3.98 6.47 7.16 -0.01 6.57 6.86 8.36
1187 1187 -0.69 -0.37 9.15 -1.33 6.89 5.81 4.26 5.08 -2.30 -3.42 -3.63 -3.16 -3.22 -1.57 -4.10 -5.35 -5.68 -6.11 -5.54 -3.50 -5.24 -5.72 -5.97 -4.94 -4.25 -4.46 -3.55 -4.64 1.81 -2.92 -4.44 -4.49 -3.53 -4.94 -3.15 -4.13 -4.47 -4.90 -4.24 -3.49 -3.94 -4.99 -4.02 -2.69 -3.69 5.13 5.57 9.93 6.13 8.47 12.32 6.83 5.94 1.64 6.82 7.74 1.98 6.11 0.99 0.19 4.34 4.36 5.47 0.92 2.69 -2.69 -1.93 -2.79 -1.63 -1.69 4.58 0.41 6.14 -3.06 1.05 2.10 -2.95 4.51 4.30 2.57 0.08 8.27 9.54 4.61 1.39 1.91 5.91 8.59 7.34 8.04 4.29 -0.04 2.17 0.42 5.39 6.95 5.68 6.09 -0.68 -1.29 -0.76 1.84 3.06 4.40 -0.52 -1.52 -1.90 -2.44 -1.46 -1.00 -1.33 -2.41 -3.67 2.48 0.27 1.02 4.19 4.43 4.19 0.33 3.24 4.38 3.92 5.09 4.42 1.01 -0.53 1.36 2.25 4.54 5.10 3.45 0.65 0.83 0.36 1.68 2.56 2.70 0.02 -1.02 -0.93 -0.22 0.11 1.60 -2.70 -1.31 1.08 0.54 6.29 7.97 -3.22 -1.34 7.50 0.48 2.19 1.49 3.09 2.71 -0.38 6.86 3.77 4.31 3.23 -1.82 3.80 5.05 -3.31 -2.18 -2.21 -2.01 4.91 6.84 7.14 0.14 6.03 6.55 7.91
940 940 -0.69 -0.36 8.95 -0.13 7.10 5.86 4.35 5.92 -1.97 -3.40 -3.41 -2.99 -3.01 -1.65 -3.55 -4.82 -5.41 -5.84 -5.13 -2.83 -4.86 -5.51 -5.51 -4.63 -3.73 -4.00 -2.92 -4.21 2.79 -1.41 -4.80 -5.47 -2.10 -5.47 -2.14 -4.18 -4.84 -5.24 -4.64 -3.20 -3.90 -5.24 -3.77 -2.70 -2.76 5.21 5.86 9.78 6.38 8.29 12.49 7.01 6.49 1.97 7.17 7.62 2.40 6.93 1.85 1.45 5.11 5.30 6.27 2.35 3.31 -2.50 -1.41 -2.61 -0.93 -1.03 4.54 1.59 6.03 -2.74 1.79 2.68 -8.16 5.19 5.14 3.16 0.24 9.09 10.25 5.44 1.90 2.46 6.66 9.19 8.24 8.46 5.73 1.10 2.58 1.15 6.37 7.28 6.51 7.20 -0.48 -0.69 -0.02 2.56 3.76 5.33 -0.16 -1.18 -1.18 -2.16 -1.06 -0.19 -0.48 -2.35 -3.16 2.79 0.72 2.14 4.80 4.84 4.55 1.27 4.26 5.23 4.40 5.43 4.56 2.32 0.03 2.15 3.22 5.06 5.28 3.80 1.38 1.58 0.98 2.27 2.94 3.39 0.33 -0.53 0.17 0.53 0.57 1.69 -2.21 -0.76 1.25 0.49 6.49 8.84 -4.02 -1.33 7.42 0.71 2.81 2.03 3.30 3.00 -0.24 7.02 3.82 4.66 3.36 -1.18 3.82 4.91 -2.95 -2.89 -2.43 -2.05 4.25 7.02 7.36 0.14 6.57 6.68 8.12
936 936 -0.19 -0.34 8.54 -0.62 7.01 5.95 4.24 5.41 -1.89 -2.84 -3.38 -3.11 -2.94 -1.45 -3.83 -4.43 -5.61 -5.41 -5.54 -2.94 -4.78 -6.06 -5.88 -4.70 -4.82 -4.46 -2.66 -3.82 2.85 -2.70 -5.16 -5.47 -3.31 -5.61 -2.80 -4.11 -4.97 -4.86 -5.01 -3.63 -3.78 -5.29 -4.17 -2.49 -3.65 5.31 5.60 9.87 6.67 8.05 12.33 6.72 6.42 1.25 7.28 7.37 1.99 6.28 1.17 0.50 4.52 4.43 5.54 1.30 3.08 -2.92 -2.16 -3.18 -1.66 -1.63 4.55 0.53 5.73 -3.27 1.30 1.70 -2.57 4.53 4.14 2.61 -0.18 8.32 9.62 4.82 1.58 1.99 5.82 8.59 7.58 8.39 4.68 0.36 2.01 -0.31 5.71 7.35 6.22 6.66 -0.70 -1.42 -0.62 2.13 3.54 4.85 -0.72 -1.53 -2.04 -2.37 -1.38 -0.96 -1.57 -2.91 -3.60 2.37 0.21 0.92 4.05 4.27 4.33 0.24 3.38 4.45 3.71 4.74 4.44 1.51 -1.73 1.51 2.27 4.37 4.89 3.40 0.66 0.83 0.27 1.50 2.30 2.60 0.14 -0.90 -0.99 -0.53 -0.30 1.14 -3.06 -1.69 0.39 0.19 6.21 8.05 -2.75 -0.87 7.79 0.87 2.48 1.62 3.28 2.93 -0.41 6.91 3.75 4.38 3.20 -1.07 3.81 4.89 -3.36 -2.40 -2.06 -2.03 3.99 7.36 6.94 0.14 6.26 6.47 7.98
788 788 -1.96 -0.35 8.73 -0.80 6.90 5.95 4.88 5.39 -1.55 -2.45 -3.51 -2.84 -2.83 -1.71 -3.91 -4.05 -5.61 -4.63 -5.29 -3.51 -4.86 -5.97 -5.27 -4.90 -4.40 -4.63 -3.11 -3.99 2.87 -2.23 -4.61 -5.04 -3.53 -5.08 -3.02 -4.41 -4.72 -5.18 -4.72 -3.63 -3.61 -5.29 -4.05 -2.31 -3.73 4.69 5.31 9.69 6.76 8.21 12.18 6.75 6.51 1.15 7.38 7.93 1.76 5.68 -0.02 -0.65 4.14 3.36 4.43 0.21 1.98 -2.31 -1.54 -2.30 -1.66 -1.47 4.48 0.88 6.47 -2.50 0.74 1.12 -2.17 4.31 3.50 2.09 -0.60 8.06 9.69 3.99 0.54 1.60 5.60 8.71 7.32 8.03 3.27 -0.98 1.59 -0.20 5.68 7.16 5.57 6.16 -0.79 -1.31 -0.87 2.17 3.23 4.57 -0.93 -1.80 -2.27 -2.51 -1.74 -1.02 -1.92 -2.02 -3.79 1.95 -0.24 0.40 3.73 4.13 3.71 0.03 2.89 4.06 3.54 4.76 3.88 0.53 -2.11 1.27 1.99 4.13 4.58 2.88 0.22 0.39 0.22 1.44 2.02 2.22 0.00 -0.81 -1.10 -0.41 -0.09 1.00 -2.66 -1.55 0.33 0.19 6.47 7.89 -4.40 -1.94 7.65 0.38 1.66 0.84 2.78 2.26 -0.84 6.52 3.53 3.81 2.83 -1.69 3.65 4.47 -3.81 -2.97 -2.88 -2.29 3.88 6.99 7.38 -0.10 6.00 6.52 8.04
698 698 -1.90 -0.63 8.24 -0.46 6.94 5.42 4.70 4.62 -1.78 -3.14 -3.46 -2.90 -2.94 -1.65 -4.20 -4.56 -5.68 -5.61 -5.41 -2.92 -5.04 -5.97 -6.06 -4.90 -4.22 -4.20 -3.05 -4.61 2.15 -2.87 -4.68 -5.08 -3.69 -5.24 -3.63 -4.24 -5.16 -5.35 -4.97 -3.61 -3.99 -5.35 -3.98 -2.59 -3.95 5.15 5.82 10.00 5.54 8.15 12.28 6.80 6.23 1.88 7.07 7.38 2.06 6.79 1.67 1.00 4.79 4.79 5.71 1.99 3.29 -2.13 -1.01 -1.85 -1.23 -0.90 4.41 -0.02 6.09 -2.10 1.66 2.27 -3.48 4.96 4.76 2.64 0.05 8.91 9.99 5.16 1.53 2.11 6.28 8.77 8.03 8.66 5.99 0.87 2.30 0.63 6.23 7.50 6.75 7.22 -0.45 -0.81 -0.11 2.57 3.93 5.16 -0.31 -1.19 -1.25 -1.93 -0.89 0.07 -0.87 -1.12 -3.03 2.61 0.54 1.83 4.50 4.53 4.42 1.15 4.02 4.91 4.06 5.06 4.42 2.02 -1.03 1.87 2.96 4.84 5.08 3.62 1.13 1.23 0.75 2.26 2.80 3.04 0.41 -0.39 0.02 0.31 0.52 1.73 -2.28 -0.73 1.06 0.72 6.44 7.27 -3.08 -1.23 7.35 0.92 2.60 2.00 3.69 3.20 -0.25 7.38 4.15 5.00 3.88 -1.39 4.31 5.20 -3.47 -2.75 -1.97 -1.96 4.18 6.81 6.75 0.02 6.49 5.97 7.78
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)

outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
  dplyr::select(ID, hs_child_age_None, h_cohort, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")

all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))

selected_id_data <- cbind(outcome_and_cov, extracted_exposome)

# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)

selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
head(selected_metabolomics_data)

Group LASSO

selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()

set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data  <- selected_metabolomics_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_train <- metabol_train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_test <- metabol_test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1  # subtract for the ID column

group_indices <- c(
  rep(1, num_covariates),     # Group 1: Covariates
  rep(2, num_diet),           # Group 2: Diet
  rep(3, num_chemicals),      # Group 3: Chemicals
  rep(4, num_metabolomics)    # Group 4: Metabolomics
)

if (length(group_indices) < ncol(x_train)) {
  group_indices <- c(group_indices, rep(5, ncol(x_train) - length(group_indices)))
}

length(group_indices) == ncol(x_train)  # This should be TRUE
## [1] TRUE
# Fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")

group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)

cat("Group Lasso Test MSE:", mse_group_lasso, "\n")
## Group Lasso Test MSE: 0.8706314
cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9330763

Group Ridge

set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 0.8862179
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9413914

Group Elastic Net

set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 0.8889618
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9428477

Lasso Regression: Shows the most significant improvement, demonstrating the value of the combined variables in predictive modeling.

Decision Trees

selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()

set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, 
                                  list = FALSE, 
                                  times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data  <- selected_metabolomics_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_test <- test_data$hs_zbmi_who

set.seed(101)
tree_model <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(tree_model)

tree_predictions <- predict(tree_model, newdata = test_data)
tree_mse <- mean((tree_predictions - y_test)^2)
rmse_tree <- sqrt(tree_mse)
cat("Diet + Chemical + Metabolomic + Covariates Decision Tree MSE:", tree_mse, "\n")
## Diet + Chemical + Metabolomic + Covariates Decision Tree MSE: 1.545318
cat("Diet + Chemical + Metabolomic + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Metabolomic + Covariates Decision Tree RMSE: 1.243108

The Decision Tree model shows a higher MSE of 1.545 and RMSE of 1.243 when all variables are included, indicating it may struggle with the complexity introduced by metabolomic data.

Random Forest

set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
##                       IncNodePurity
## hs_child_age_None         4.3510904
## h_cohort                 11.4847925
## e3_sex_None               0.6469470
## e3_yearbir_None           5.0397600
## hs_cd_c_Log2              5.4034834
## hs_co_c_Log2              5.0423993
## hs_cs_c_Log2              4.6266269
## hs_cu_c_Log2             11.3629923
## hs_hg_c_Log2              5.3986631
## hs_mo_c_Log2             10.5561526
## hs_pb_c_Log2              6.0375115
## hs_dde_cadj_Log2         15.2674613
## hs_pcb153_cadj_Log2      43.7859103
## hs_pcb170_cadj_Log2      77.4592518
## hs_dep_cadj_Log2          6.9179021
## hs_pbde153_cadj_Log2     29.2165376
## hs_pfhxs_c_Log2           5.5574508
## hs_pfoa_c_Log2            9.1473793
## hs_pfos_c_Log2            6.2406329
## hs_prpa_cadj_Log2         5.5785583
## hs_mbzp_cadj_Log2         4.8381018
## hs_mibp_cadj_Log2         4.2835618
## hs_mnbp_cadj_Log2         4.0952772
## h_bfdur_Ter               3.0556126
## hs_bakery_prod_Ter        2.8734807
## hs_dairy_Ter              1.1982993
## hs_fastfood_Ter           0.7788861
## hs_org_food_Ter           1.0689860
## hs_readymade_Ter          1.7494233
## hs_total_bread_Ter        1.3698662
## hs_total_fish_Ter         1.3784976
## hs_total_fruits_Ter       1.1543863
## hs_total_lipids_Ter       1.0160675
## hs_total_potatoes_Ter     1.5864111
## hs_total_sweets_Ter       1.0535255
## hs_total_veg_Ter          1.2142266
## metab_1                   5.0246178
## metab_2                   4.9975168
## metab_3                   3.4643572
## metab_4                   5.5046649
## metab_5                   3.9463992
## metab_6                   7.3085806
## metab_7                   4.4257574
## metab_8                  31.8547065
## metab_9                   2.9959180
## metab_10                  3.1758405
## metab_11                  3.8734754
## metab_12                  3.2738085
## metab_13                  5.0382976
## metab_14                  4.8505501
## metab_15                  4.6498518
## metab_16                  2.8186537
## metab_17                  2.5814905
## metab_18                  3.4531828
## metab_19                  2.2677950
## metab_20                  3.5314013
## metab_21                  2.2608049
## metab_22                  2.5923937
## metab_23                  2.9864977
## metab_24                  3.7658293
## metab_25                  3.4717235
## metab_26                  7.5922897
## metab_27                  3.3079110
## metab_28                  3.0942746
## metab_29                  3.4194697
## metab_30                 19.0900288
## metab_31                  3.7296241
## metab_32                  2.9438450
## metab_33                  4.7884692
## metab_34                  2.5374169
## metab_35                  7.8801757
## metab_36                  3.5484238
## metab_37                  3.4615654
## metab_38                  2.8551929
## metab_39                  2.9023861
## metab_40                  5.5303117
## metab_41                  4.1589018
## metab_42                  5.7814479
## metab_43                  2.9051396
## metab_44                  3.1993167
## metab_45                  3.8627595
## metab_46                  4.7165866
## metab_47                  6.5126371
## metab_48                 11.6532289
## metab_49                 34.0040406
## metab_50                 10.2784818
## metab_51                  5.5997592
## metab_52                  3.5422826
## metab_53                  5.1768364
## metab_54                  4.5873675
## metab_55                  8.5822696
## metab_56                  3.5655906
## metab_57                  4.6641997
## metab_58                  3.3604162
## metab_59                  5.7724297
## metab_60                  4.8328602
## metab_61                  3.5217162
## metab_62                  3.6407068
## metab_63                  4.4807353
## metab_64                  4.3958412
## metab_65                  3.2690407
## metab_66                  2.6464596
## metab_67                  2.9324104
## metab_68                  3.9480657
## metab_69                  2.6555957
## metab_70                  2.6514940
## metab_71                  4.2321834
## metab_72                  3.7483895
## metab_73                  3.5601435
## metab_74                  2.5232306
## metab_75                  4.1790361
## metab_76                  2.3825409
## metab_77                  4.5203448
## metab_78                  4.2542324
## metab_79                  3.9202219
## metab_80                  3.5883304
## metab_81                  3.3478113
## metab_82                  5.2828197
## metab_83                  4.0533988
## metab_84                  3.3750068
## metab_85                  5.5406981
## metab_86                  3.5952497
## metab_87                  3.1383387
## metab_88                  2.6259304
## metab_89                  2.6560929
## metab_90                  2.6864011
## metab_91                  2.6201838
## metab_92                  3.0571047
## metab_93                  3.2157707
## metab_94                  9.3374896
## metab_95                 52.7476025
## metab_96                  7.0768415
## metab_97                  3.4158131
## metab_98                  3.4086182
## metab_99                  5.7843825
## metab_100                 3.7188542
## metab_101                 2.7941056
## metab_102                 4.9365328
## metab_103                 3.6992223
## metab_104                 4.7269214
## metab_105                 3.6214074
## metab_106                 3.6727061
## metab_107                 3.6544399
## metab_108                 3.2550617
## metab_109                 5.4854345
## metab_110                 7.4270442
## metab_111                 2.6425116
## metab_112                 2.7255616
## metab_113                 5.4538156
## metab_114                 3.3385618
## metab_115                 5.2528540
## metab_116                 4.6202248
## metab_117                 6.6989808
## metab_118                 2.5575078
## metab_119                 6.1560378
## metab_120                 7.0372080
## metab_121                 4.0063548
## metab_122                 6.9004492
## metab_123                 2.9740618
## metab_124                 3.8403649
## metab_125                 3.0350595
## metab_126                 2.5287840
## metab_127                 6.3605217
## metab_128                 6.3535024
## metab_129                 3.6497741
## metab_130                 3.5048690
## metab_131                 2.7358289
## metab_132                 3.1416537
## metab_133                 2.8089910
## metab_134                 3.8237787
## metab_135                 4.2005101
## metab_136                 5.3871803
## metab_137                 7.2446208
## metab_138                 5.6624417
## metab_139                 3.1680018
## metab_140                 3.2342640
## metab_141                 7.0874046
## metab_142                13.2308029
## metab_143                 8.1427795
## metab_144                 3.6521039
## metab_145                 3.9931242
## metab_146                 4.1109424
## metab_147                 3.7136688
## metab_148                 3.4295763
## metab_149                 5.0799668
## metab_150                 5.2297917
## metab_151                 3.6133544
## metab_152                 4.4017111
## metab_153                 4.1872432
## metab_154                 4.0488667
## metab_155                 2.6370643
## metab_156                 2.6304464
## metab_157                 3.3068935
## metab_158                 3.3021198
## metab_159                 2.8824300
## metab_160                 8.1179141
## metab_161                26.5008653
## metab_162                 3.7626315
## metab_163                16.1969248
## metab_164                 6.8100777
## metab_165                 3.6030471
## metab_166                 3.9155699
## metab_167                 3.3543066
## metab_168                 3.0729228
## metab_169                 4.0738976
## metab_170                 4.7208531
## metab_171                 4.1684339
## metab_172                 3.6712673
## metab_173                 4.0493143
## metab_174                 4.0874861
## metab_175                 4.4045373
## metab_176                 5.7419058
## metab_177                14.8456884
par(mar = c(6, 14, 4, 4) + 0.1) 
varImpPlot(rf_model, cex = 0.6)

rf_predictions <- predict(rf_model, newdata = test_data)

rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)
cat(" Diet + Chemical + Metabolomic + Covariates Random Forest MSE:", rf_mse, "\n")
##  Diet + Chemical + Metabolomic + Covariates Random Forest MSE: 1.005087
cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.00254

Continues to perform robustly, though the improvement is not as pronounced as Lasso or GBM.

GBM

gbm_model <- gbm(hs_zbmi_who ~ ., data = train_data, 
                 distribution = "gaussian",
                 n.trees = 1000,
                 interaction.depth = 3,
                 n.minobsinnode = 10,
                 shrinkage = 0.01,
                 cv.folds = 5,
                 verbose = TRUE)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.4864             nan     0.0100    0.0033
##      2        1.4807             nan     0.0100    0.0045
##      3        1.4769             nan     0.0100    0.0019
##      4        1.4720             nan     0.0100    0.0037
##      5        1.4653             nan     0.0100    0.0057
##      6        1.4596             nan     0.0100    0.0042
##      7        1.4552             nan     0.0100    0.0030
##      8        1.4516             nan     0.0100    0.0029
##      9        1.4463             nan     0.0100    0.0029
##     10        1.4402             nan     0.0100    0.0056
##     20        1.3955             nan     0.0100    0.0036
##     40        1.3186             nan     0.0100    0.0021
##     60        1.2544             nan     0.0100    0.0011
##     80        1.1915             nan     0.0100    0.0014
##    100        1.1374             nan     0.0100    0.0018
##    120        1.0889             nan     0.0100    0.0009
##    140        1.0444             nan     0.0100    0.0017
##    160        1.0035             nan     0.0100    0.0005
##    180        0.9665             nan     0.0100    0.0006
##    200        0.9327             nan     0.0100    0.0000
##    220        0.9018             nan     0.0100    0.0004
##    240        0.8717             nan     0.0100    0.0002
##    260        0.8433             nan     0.0100    0.0004
##    280        0.8178             nan     0.0100    0.0001
##    300        0.7931             nan     0.0100    0.0004
##    320        0.7704             nan     0.0100    0.0002
##    340        0.7492             nan     0.0100    0.0002
##    360        0.7299             nan     0.0100    0.0005
##    380        0.7107             nan     0.0100    0.0001
##    400        0.6927             nan     0.0100    0.0001
##    420        0.6760             nan     0.0100    0.0002
##    440        0.6604             nan     0.0100   -0.0002
##    460        0.6439             nan     0.0100   -0.0001
##    480        0.6288             nan     0.0100    0.0001
##    500        0.6138             nan     0.0100   -0.0002
##    520        0.6004             nan     0.0100    0.0005
##    540        0.5877             nan     0.0100    0.0002
##    560        0.5757             nan     0.0100   -0.0001
##    580        0.5642             nan     0.0100   -0.0001
##    600        0.5530             nan     0.0100   -0.0000
##    620        0.5416             nan     0.0100   -0.0001
##    640        0.5309             nan     0.0100   -0.0003
##    660        0.5208             nan     0.0100   -0.0001
##    680        0.5108             nan     0.0100    0.0000
##    700        0.5014             nan     0.0100   -0.0001
##    720        0.4919             nan     0.0100    0.0002
##    740        0.4830             nan     0.0100   -0.0001
##    760        0.4737             nan     0.0100   -0.0001
##    780        0.4653             nan     0.0100   -0.0002
##    800        0.4570             nan     0.0100   -0.0002
##    820        0.4492             nan     0.0100   -0.0001
##    840        0.4421             nan     0.0100   -0.0001
##    860        0.4347             nan     0.0100   -0.0000
##    880        0.4274             nan     0.0100   -0.0000
##    900        0.4207             nan     0.0100   -0.0000
##    920        0.4136             nan     0.0100   -0.0000
##    940        0.4068             nan     0.0100   -0.0003
##    960        0.4006             nan     0.0100   -0.0001
##    980        0.3946             nan     0.0100   -0.0003
##   1000        0.3880             nan     0.0100    0.0000
summary(gbm_model)
# finding the best number of trees based on cross-validation
best_trees <- gbm.perf(gbm_model, method = "cv")

predictions_gbm <- predict(gbm_model, test_data, n.trees = best_trees)
mse_gbm <- mean((y_test - predictions_gbm)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Chemical + Metabolomic + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM MSE: 0.8890887
cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.942915
summary(gbm_model)

Including diet, chemical, and metabolomic variables with the covariates leads to the best improvements in model performance, indicating these additional predictors provide substantial predictive value.

The GBM model shows the best performance among the advanced models, with MSE decreasing to 0.889 and RMSE to 0.943, highlighting its ability to effectively leverage a comprehensive set of predictors.

Overall Comparison of MSE/RMSE

results <- data.frame(
  Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
  Data_Set = rep(c("Baseline", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemicals + Covariates", "Full Model"), 6),
  MSE = runif(30, min = 0.2, max = 0.5),  # Replace with actual MSE values
  RMSE = runif(30, min = 0.4, max = 0.7)  # Replace with actual RMSE values
)

mse_plot <- ggplot(results, aes(x = Data_Set, y = MSE, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_plot)

print(rmse_plot)